Riak

Riak Big Sets, CRDTs, Adding SQL to NoSQL, Scaling your application, and more

Dorothy Pults — Thu, 27 Oct 2016 03:02:41 +0000

The Riak team of distributed systems engineers continues to develop new Riak functionality to help companies build innovative applications. A few members of the Riak team recently took a break from their day jobs to share some of the details of their recent work at the Erlang User Conference in Stockholm. In case you missed them, we wanted to share videos of these presentations. (Hint: They aren’t just for people passionate about Erlang)

Russell Brown provided an update on the research and implementation of CRDTs and his recent work on expanding CRDT support for bigger sets in Riak. Gordon Guthrie showed how NoSQL really means “not only SQL” and explained the new capabilities and techniques in Riak that are specifically optimized for time series data. Torben Hoffmann gave some hints on application architectures that have the flexibility to serve both the short and long term needs of a start-up. Andy Till explained how to debug Erlang, Elixir, and LFE applications using tracing with Erlyberly. Plus, Magnus Kessler leads a hands-on tutorial of Riak Distributed Data Types (CRDTs).

See more information on each of these presentations below.

Grab your popcorn and enjoy!

Big(ger) Sets: Making CRDT Sets Scale in Riak

Russell Brown
Eventually Consistent CRDTist at Riak
GitHub: russelldb

Watch Video

Presentation overview:

This talk looks at the original implementation of Riak Distributed Data Types (CRDTs) and shows a new approach to designing CRDTs in Riak from the ground up that comes with a great deal more scale and performance.

Talk objective:

Illustrate the engineering challenges inherent in taking research papers into a real world product. This is both a cautionary tale and a showcase of our recent work.

From NoSQL to More SQL – Adding Structure and Queriability to Riak

Gordon Guthrie
Senior Engineer @ Riak and Serial Entrepreneur
GitHub: gordonguthrie Twitter: @gordonguthrie

Watch Video

Presentation overview:

Riak is an industry leader in the NoSQL space but with the new Riak Time Series offering more traditional tools like native SQL querying is being added. This talk will look at meta-programming being used in a gossip-based cluster to build an adaptive Riak that reconfigures itself to handle structured data – and how we use standard SQL to interrogate that data.

Talk objective:

Explain the new capabilities and techniques in Riak TS

Winning as a Start-Up by Failing Fast

Torben Hoffmann
Chief Architect at Riak
GitHub: LeHoff Twitter: @LeHoff

Watch Video

Presentation Overview:

Scalability is a word often used to describe Riak KV & Erlang/Elixir, but it is not the first concern for a start-up. Scalability is a rich man’s problem.

Sure, you need a stack that can scale… when you are ready!

Until that point, you need something that is flexible and allows you to iterate over a lot of experiments in a short period of time.

Experiments with software often involve errors. Erlang/Elixir has a unique approach to dealing with errors that lends itself well to do the experiments and, at the same time, keep a start-up rolling. We will look into how you should architect your software to leverage this, so you work with the BEAM and not against it.

Riak KV is a scalable, reliable NoSQL database that takes the Erlang philosophy regarding failures to heart – don’t ignore failure, embrace failure!!

But wait a second… if scalability is a rich man’s problem, what role does Riak KV play for a start-up?

This talk will go into how to approach this dilemma by attacking it with an architecture that has the flexibility that serves both the short and long term needs of a start-up.

Talk objective:

Show how Riak KV and Erlang/Elixir can help a start-up focus on the most important thing: conducting experiments fast to get to a viable business model before the money runs out.

Trace Debugging With Erlyberly

Andy Till
Riak Time Series Database Developer
GitHub: andytill Twitter: @andy_till

Watch Video

Presentation overview:

The BEAM virtual machine has flexible and powerful tooling from introspection, statistics, and debugging without affecting the running application. Erlyberly is an ongoing project to lower the barrier for entry for using these capabilities which focuses on tracing.

Talk objective:

Learn how to debug Erlang, Elixir, and LFE applications using tracing with erlyberly.

Tutorial: No more fighting with your siblings. Riak distributed Data Types (CRDTs) remove the stress

Slides (no video available)

Magnus Kessler
Client Services Engineer @ Riak

Tutorial Overview:
Choosing a highly available database can mean sacrificing some consistency of data during failure scenarios, but it should not mean data loss. Databases designed with partition tolerance and eventual consistency in mind can offer multiple ways to handle conflict resolution, but some can be difficult to reason about or so be simple that you lose the configurability you need.

In Riak KV, we recommend you allow ‘siblings’ or multiple versions of data to be stored whenever there is no way to determine the correct latest version. But once you have multiple siblings, how do you get back to the single correct and consolidated version of data your application is expecting to use?

Riak Data Types are distributed data structures designed to provide deterministic resolution logic, removing the need for the application developer to write such functionality in an ad-hoc manner or multiple varying ways across a large project.

Summary:

We will demonstrate why Riak KV would generate siblings and then implement each of the Riak Data Types available in Riak KV to solve the challenges of working with a highly available and eventually consistent database.

Next Steps

If you found these presentations interesting, you might also enjoy these blogs:

Running Riak in Docker
Riak’s Spark-Riak Connector 1.6.0 Now Available
NoSQL Riak TS Gets JDBC Driver Inspired by SQL

3 part series by Damien Krotkine, Ivan Paponov at bookings.com

Using Riak as Events Storage – Part 1
Using Riak as Events Storage – Part 2
Using Riak as Events Storage – Part 3

Brussels is famous for beer, chocolate, mussels and now the Spark Summit 2016

Toni Vicars — Mon, 24 Oct 2016 16:38:31 +0000

Riak is excited to continue to be part of the premier big data event for the Apache® Spark™ Community by participating at the upcoming Spark Summit being held between October 25-27th. Thousands of developers, scientists, analysts, researchers, and executives have attended the Databricks organized series of events and we’re looking forward to seeing everyone at the Square Brussels for the European leg of the tour.

At the Spark Summit, Riak will be highlighting our recently launched Riak Apache Spark Connector. Riak users now have an immediate integrated, scalable solution for powerful and fast Big Data analytics while Spark users now the ability to simply utilize a resilient, highly available datastore. It brings the full power of Apache Spark to the operational data, managed in Riak distributed clusters. Apache Spark and Riak have common design principles: high performance, scalability, resiliency and operational simplicity.

John Musser, our Vice President of Engineering, is out from Seattle and will be presenting “Lessons Learned Optimized NoSQL for Apache Spark” at the Summit on Wednesday 26th at 12.10 in the Silver Hall. Not able to attend? Sign up for our blog updates (you can do this by scrolling down to the bottom of this page and a pop-up will ask you for your details) and we’ll let you know when his slides are available.

If you would like to get more familiar with our Apache Spark integration, we recently published a blog explaining our approach titled The Quest for Hidden Treasure and more recently a post on our most recent release Riak’s Spark-Riak Connector 1.6.0. We have also published a notebook tutorial on Databrick.com.

So drop by Booth #P6 and say hi to the Riak team. We have our Scalextric demo on hand and the fastest lap will win! Ask me for some treasure and I’ll find someone to talk your ear off about Riak and the Spark Connector while grabbing yourself a sticker or two.

How IoT is making distributed computing cool again – by Adam Wray

Stephen Condon — Tue, 04 Oct 2016 20:53:17 +0000

Ten things that need sorting out before the IoT takes off

Toni Vicars — Mon, 03 Oct 2016 14:52:57 +0000

Exclusive interview with Dave McCrory our CTO

Toni Vicars — Fri, 30 Sep 2016 14:39:30 +0000

Don’t miss Jessica Twentyman’s exclusive interview with Dave McCrory where he discusses his unique concept of Data Gravity along with why NoSQL databases are an important part of the technology stack for all businesses dealing with the need to manage scale and active data workloads.

If you’d like to see Dave talk about this in person, he will be one of the headlining speakers at the upcoming Big Data London event on 3-4th November at Olympia.

The conference and exhibition are free to attend you just need to register.

Make sure you come by our stand as we’ll be demonstrating Riak TS as we track lap times on our Scalextric demo – the person with the fastest lap time will win their own Scalextric set to take home!

Welcome to the (IoT) Jungle

Toni Vicars — Thu, 29 Sep 2016 19:40:10 +0000

It’s no secret that we’re hot on IoT education here at Riak. We’re in the middle of our global IoT Roadshow, and earlier this year open sourced our Riak TS database, specifically designed to tackle the needs of IoT applications. However even we didn’t think we’d be experiencing the heat of an IoT jungle. But that’s exactly what happened in London last week, as we held a roundtable discussion with some of the sharpest minds in the IoT industry at the Soho Blanchette’s ‘Jungle Room’.

The session was chaired by Alexandra Deschamps-Sonsino, the founder of the connected Good Night Lamp, and director of IoT consultancy designswarm. Also in attendance was Riak’s MD EMEA, Manu Marchal, Actual Experience’s CEO, Dave Page, and Intellicore’s iCTO Declan Caulfield. Alongside a host of industry analysts and journalists, this panel discussed the current state of play in IoT, the challenges facing the space as well as the viable use cases we’re seeing.

In terms of the challenges currently facing IoT, Manu Marchal argued that scale is a big issue for today, and an even bigger issue for the future. Gartner predicts there will be 21 billion IoT devices in operation by 2020. Imagine each one of these devices receiving and transmitting data in real time, the scale is unimaginable. Unless technologies are designed to handle time series data in real time and at large scale, failure is inevitable.

Declan Caulfield of Intellicore argued that on the devices/sensors side of things, there are a number of challenges which need to be overcome in order for IoT to thrive. Power consumption is going to be a key factor to take into account. There has been a lot of innovation in the battery industry to facilitate IoT but there is more work to be done. Sensors also need to be produced in a recyclable fashion to prevent environmental damage, and miniaturisation of technology will need to be another focus as we strive to make sensors and devices smaller than ever before.

Open source technology emerged as one of the IoT champions, with its ability to facilitate some of the most innovative development work in recent years. The interoperable nature of the technology is a particular attribute which makes it perfectly suited for IoT, with billions of different devices, compatibility is going to key. In terms of businesses undergoing digital transformation, Dave Page confirmed that Actual Experience is seeing this everywhere it turns.

The room of experts were unanimous that the real value of IoT comes from the data, not the devices. We’re generating and analysing more data than ever, and this trend is only increasing. John Leonard of Computing raised the issue of data ownership. For example, if you live in a smart city collecting huge volumes of data on its citizens, which is likely to be managed by a large traditional IT organisation, what can they do with that data? Will it be anonymised, will encryption be involved, and will the government have the right to access it? We don’t have the answer to these questions yet, but Manu Marchal closed the discussion by stating that he believes he’ll be living his life very differently over the next five years thanks to IoT. He’s personally looking forward to self-driving cars arriving sooner than later.

To discover your IoT IQ take our time series quiz today or to find out more about time series databases read our quick and informative cheat sheet.

Running Riak in Docker

Jon Brisbin — Thu, 29 Sep 2016 13:00:28 +0000

Unless you’ve been living under a rock for the last couple of years (and believe me, given what’s happening in the world today, I ain’t gonna judge) you know that Docker is building an Empire in the World of Containers. It’s permeating DevOps and infrastructure, microservices, financial services, healthcare, and just about anywhere that containerized applications make sense. Although it might one day power a smart IoT application that helps beat cancer, by itself it is no operational panacea. It can do more harm than good if wielded irresponsibly. In this blog post, I’ll lay out some groundwork for running a Riak cluster in Docker. Expect to see:

How to run single nodes as well as multiple node clusters.
How to test the container.
How to run applications that connect to Riak.
How to build your own Docker container with Riak installed.

Getting Started

To run Riak in Docker you need a relatively recent version of the daemon. Everything should probably work on 1.11, though for the purposes of this article I’ll assume you’re using a recent version of 1.12. Swarm mode is outside the scope of this post but it’s there if you want to experiment on your own.

You don’t need anything special to run Riak in Docker beyond just Docker. It seems a little anti-climactic, but here’s all you need to run a single node of Riak KV using a Ubuntu Trusty base image:

docker run --name=riak -d -p 8087:8087 -p 8098:8098 basho/riak-kv

Docker will download the image from Docker Hub (KV, TS) and start a single node.

You should then be able to start any client in your preferred language and connect to either localhost:8087 or the IP address of your box (or VM, if you happen to be running the above inside a virtual machine).

Similarly for Riak TS (Time Series):

docker run --name=riak -d -p 8087:8087 -p 8098:8098 basho/riak-ts

NOTE: Since both these example commands use the container name “riak” and the standard ports, you’ll have to stop the first container before starting the second. docker rm -f riak should do the trick.

In this example, we’re mapping the ports to their standard values using port mapping. If you want to use randomly-assigned ports that you can discover later (because you want to run multiple containers on the same host), just remove the -p options and replace them with a single -P:

docker run --name=riak -d -P basho/riak-ts

Connecting to a Riak Node

NOTE: This entire section assumes the use of Docker’s bridge networking. Using host networking will involve additional considerations and will be the topic of a different post.

In order to connect to a Riak node running in Docker, you need to know what IP address to use. That will vary depending on the network settings you’re using for that container. In the default bridge configuration, you can access Riak via the internal Docker IP address (probably 172.17. or similar) and use the default port of 8087, *regardless* of what you have set in the port mappings. If you access Riak via localhost, however, you can *only* use the mapped ports (in our example: 8087 and 8098).

TIP: If you’re running an application in another Docker container and that container has access to the Docker subnet your Riak container is running in, you should have no problems. Before confusing everything too much with custom configurations and multiple subnets, try running your Riak nodes using the Docker defaults–at least until you’re comfortable with the peculiarities of running a clustered database with complex networking needs inside Docker.

To discover the IP address and port combinations needed to connect to Riak, use docker inspect. In general, you only need to discover one of the two HOST:PORT values: either HOST or PORT. If you use the Docker internal IPs, then you can use the standard Riak ports of 8087 and 8098 for Protobuf and HTTP, respectively. If you use any other IP (like the IP address of your box or VM), then you’ll need to discover the PORT values and use a pre-determined IP address value. The two variations can be supported by using docker inspect and specifying a Go template string in the -f flag to filter the JSON output to only show the values we care about.

To discover the port

Assuming all your containers are running on a single host and you’ll reuse the IP address (for the purposes of this example: localhost), you only need to discover what the port mappings are. The following Go template expression should spit out the protobuf ports (replace 8087 with 8098 to get the HTTP ports):

docker inspect -f 'localhost:{{(index (index .NetworkSettings.Ports "8087/tcp") 0).HostPort}}' riak

This will print a list of the mapped ports, one host:port line per container. In this example, we’re only specifying the riak container which we started earlier. If we wanted to inspect an entire cluster, we’d have to list all the containers of the cluster.

docker inspect -f 'localhost:{{(index (index .NetworkSettings.Ports "8087/tcp") 0).HostPort}}' riak-1 riak-2 riak-3 riak-4 riak-5

If we pipe this output through tr, we can create a comma-separated list of HOST:PORT pairs suitable for passing to the various Riak client libraries. They each have their own way of specifying a list of nodes to connect to, so YMMV. At a minimum, you’ll want to translate the newlines to commas and maybe set an environment variable.

export RIAK_HOSTS=$(docker inspect -f 'localhost:{{(index (index .NetworkSettings.Ports "8087/tcp") 0).HostPort}}' riak | tr '\\n' , | sed 's/,$//')

NOTE: The tr is to pull the separate lines into a single line, separating them by commas, and the sed is to strip the final comma off the end since it might be a little more awkward for some logic to deal with an empty string when it’s expecting HOST:PORT.

To discover the IP

This will work on just about any Linux distribution or Mac OS X using docker-machine with an appropriate route set up. This will *not* work on the Mac OS X native Docker beta since it’s not currently possible to route traffic from your Mac across the internal xhyve VM running the Docker daemon to the 172.17. addresses that Docker uses. For more information, read the GitHub issue on this topic.

export RIAK_HOSTS=$(docker inspect -f '{{.NetworkSettings.IPAddress}}:8087' riak | tr '\\n' , | sed 's/,$//')

This will give you something like 172.17.0.3:8087. Pass this value to the client library of your choice (assuming where you’re running this client has access to the 172.17. subnet, as just discussed). That’s easiest to do if your client code is _also_ running inside a Docker container. If you’re using the native Mac beta of Docker, this is also the only way to access those 172.17. addresses reported by docker inspect.

Clustering

If you already have the infrastructure for creating Riak clusters, then you could likely reuse it with Docker by replacing direct calls to riak and riak-admin with docker exec $CONTAINER riak|riak-admin. Another option for taking advantage of Riak’s clustering is to create the cluster manually using docker exec and riak-admin. Just follow the excellent documentation on creating a cluster and prepend all the riak-admin commands with docker exec in the appropriate container. Since this is a manual process and will be blown away when you restart the node, you should really only consider this approach appropriate for ad hoc testing and custom automation; it’s also beyond the scope of this post.

Unless you’re building your own Docker image and intentionally excluding Riak Explorer for a specific reason, you can take advantage of the simple cluster bootstrapping functionality that’s baked into the Docker image. Riak Explorer is used for this because its clustering operation combines the node add and cluster commit operations into a single REST call. This bootstrapping is activated when the value of the COORDINATOR_NODE environment variable passed to docker run is the IP address of the first node in a cluster.

Starting a Dockerized Cluster

In order to start a Dockerized cluster, you must first start a coordinator node. This is the first node in a cluster and the one which subsequent nodes will join to in order to create the cluster. In these examples, we’ll start the nodes manually with docker run to illustrate the steps. Afterward, we’ll create a docker-compose.yml file to encapsulate this functionality into an easily-digestible form.

The following starts a coordinator node, mapping the ports to their default settings. This will be the primary node we interact with and the one we pass the IP address to when we start other nodes.

docker run --name=riak -d -p 8087:8087 -p 8098:8098 --label cluster.name=adhoc basho/riak-kv

NOTE: The ability to tag containers with arbitrary labels is a very powerful–and sometimes overlooked–feature of Docker. Whenever you start a container for a Riak cluster, it will make your life easier to tag that container with a label to make them easy to find and manipulate later.

We can now discover the IP address we’ll need to use as the value of COORDINATOR_NODE by using docker inspect:

$ docker inspect -f '{{.NetworkSettings.IPAddress}}' riak
172.17.0.3

Whenever we start the other containers in this cluster, we’ll just pass -e COORDINATOR_NODE=172.17.0.3 and the cluster will be auto-created.

docker run -d -P -e COORDINATOR_NODE=172.17.0.3 --label cluster.name=adhoc basho/riak-kv

TIP: Instead of using a hard-coded IP address, you can replace it with a shell expression like $(docker inspect -f '{{.NetworkSettings.IPAddress}}' riak) to facilitate automating these steps. Don’t forget to parameterize the name of the container being passed to the coordinator node (the value of --name in the docker run command).

Some notable differences between this command and the one we used to start the coordinator node are:

No --name specified. We likely won’t be referring to this individual node itself, but by finding the container ID using docker ps and filtering on the label.
No specific port mapping. Running multiple nodes on the same host means each container will have to have the standard ports mapped to available ones. It’s easiest to let Docker handle that and randomly assign the mappings. We’ll look these values up later with docker inspect anyway.
Addition of the COORDINATOR_NODE environment variable. The bootstrapping code will use this IP address to join to when starting the container.

Simply repeat the above command once for each node you want to start.

Using Riak Explorer

Riak Explorer comes bundled with Riak in the standard Docker image. It provides a comprehensive HTTP API that adds functionality not available in the standard Riak HTTP API. If you started the coordinator node using a command similar to the one shown above, you should be able to open the Riak Explorer UI in a web browser by navigating to http://localhost:8098/admin/. If you’re using Linux, you can alternatively use the Docker IP passed as the COORDINATOR_NODE value (from the example above: http://172.17.0.3:8098/admin/).

Operational Info

Besides providing a nice web GUI for interacting with data in Riak, Explorer provides some nice graphs on resource usage for the cluster, as well as providing information about the nodes in the cluster. Pull up the Ops tab, where you can select “Individual Node Details” to see the list of nodes in the cluster and have links provided to view statistics, log files, and current configuration values for each node.

I’ll let you two get acquainted. Spend as much time as you like.

Using docker-compose

It’s fairly easy to encapsulate the necessary parameters to docker run to create a Riak cluster by using docker-compose. First, download one (or both) of the following example docker-compose.yml files and save them to your local disk.

NOTE: If you name them anything other than docker-compose.yml, remember that you’ll have to add the -f myfile.yml option to docker-compose every time you run the command.

There are two services defined in this docker-compose.yml file: a coordinator node and a member node. The intent is to use the coordinator as the value for the COORDINATOR_NODE in subsequent member nodes to create the cluster. You will scale the service to 1 for coordinator and N for members (N = $CLUSTER_SIZE – 1).

To start a 5-node cluster using docker-compose, use the command scale:

docker-compose scale coordinator=1 member=4

The containers will start in the background. You can monitor their progress with the logs command.

docker-compose logs

When all the member containers have started, you should be able to execute commands like riak-admin cluster status on the coordinator and see that the member nodes have successfully joined the cluster.

$ docker-compose exec coordinator riak-admin cluster status

---- Cluster Status ----
Ring ready: true

+---------------------+------+-------+-----+-------+
|        node         |status| avail |ring |pending|
+---------------------+------+-------+-----+-------+
| (C) riak@172.17.0.2 |valid |  up   | 87.5|  50.0 |
|     riak@172.17.0.4 |valid |  up   |  0.0|   0.0 |
|     riak@172.17.0.5 |valid |  up   |  0.0|   0.0 |
|     riak@172.17.0.6 |valid |  up   |  0.0|   0.0 |
|     riak@172.17.0.7 |valid |  up   | 12.5|  50.0 |
+---------------------+------+-------+-----+-------+

When you’re ready to take down the cluster, just use docker-compose down.

A Word on Volumes

You can use volumes in docker-compose for /var/lib/data and /var/log/riak–as well as for the schemas in /etc/riak/schemas. Compose seems to encourage using --volumes-from a specific container (which you’re certainly free to do). In order to use, say, a local directory, you’ll need to declare the volume as external in the YAML config.

Advanced Configuration

In order to customize the container bootstrapping process with custom configuration, you have several options:

Mount the needed configuration items into the container as a volume. e.g. Add -v $(pwd)/riak.conf:/etc/riak/riak.conf to the docker run command.
Augment the default riak.conf using sed like the built-in bootstrapping script.
Derive a new container from the standard container and copy in the config using COPY in a Dockerfile.
Log into the container interactively using docker exec -it, change the config, and manually restart the node (not really recommended except for one-off, ad hoc situations where you’re just experimenting since everything will be lost when the container stops).

Bucket Type Bootstrapping

The Riak Docker images contain specialized bootstrapping code to find files in the /etc/riak/schemas directory that end in .dt or .sql and use the contents of those files to create bucket types or time series tables, respectively.

Create Schema Files

If you want to bootstrap a KV datatype, create a file in the /etc/riak/schemas directory named bucket_name.dt. Replace bucket_name with the name you want to use for the bucket. Inside the file, include a single line that contains the datatype you want to use for this bucket.

For example, to create a bucket named “counter” for the CRDT datatype counter, create a file named counter.dt and put the text counter as the only content in the file.

echo "counter" >schemas/counter.dt

Mount the schemas into the container as a volume when running the container:

docker run -d -P -v $(pwd)/schemas:/etc/riak/schemas basho/riak-ts

If you pull up Riak Explorer, as described above, you should see a bucket type of “counter” listed in the DATA tab.

Create TS Tables

The process for creating time series tables is identical to that for bucket types, except the content of the file will be a CREATE TABLE command.

cat 
When you run the container using the above command, the bootstrapping code will create and activate your table and you can then start using it right away.
Volumes for Data and Logs
The Riak Docker image exposes several volumes you can use instead of leaving the nodes to be completely ephemeral and losing everything when the container is shut down. Attach volumes using the -v or –volumes-from switches when starting the container.
docker run -d -P -v $(pwd)/data:/var/lib/riak -v $(pwd)/logs:/var/log/riak basho/riak-kv

BYOC: Build Your Own Container
If you want the complete flexibility of building your own Docker container. Clone the source of the repo basho-labs/riak-docker and follow the build instructions in the README.adoc (which is executable using asciibuild).
Appendix

Riak Docker image repo
Riak Explorer repo
Riak Python client repo
Riak KV docs
Riak TS docs
Riak KV on Docker Hub
Riak TS on Docker Hub
asciibuild literate build extension to Asciidoctor

Jon Brisbin 

@j_brisbin

Riak Advances NoSQL and IoT Technologies with Latest Versions of Industry-leading Riak Databases

Stephen Condon — Wed, 28 Sep 2016 13:30:53 +0000

Product Enhancements Empower Organizations to Transcend the Chaos of Big Data in Enterprise IT Environments

SEATTLE — Sept. 28, 2016 — Riak Technologies, the creator and developer of Riak® KV and Riak® TS, the world’s most resilient NoSQL databases, today announced the latest versions of its Riak TS and Riak KV databases. Riak also announced enhancements, including an updated Spark Connector, and enhanced Mesos support. These collective updates reinforce Riak’s position as a leading NoSQL company that enables organizations to quickly create value from massive amounts of data, including streaming IoT sensor and device data, while relieving engineering and operational burdens with solutions that assure high availability and operational simplicity.

Riak NoSQL databases provide a lower total cost of ownership (TCO) than competitors, as well as the high scalability enterprises require. The improvements to Riak KV and Riak TS make it easier for application developers to build scalable, future-proof applications and allow fine grain control over running operations to ensure peak performance. Building on work done with Cisco for MesosCon, Riak has now introduced an Erlang-based framework for Mesos to help its customers increase machine utilization efficiency.

Riak databases address some of the most pressing issues facing developers as they build big data and IoT applications capable of scaling to unprecedented size and withstanding the rigors of today’s modern data realities. A recent Riak sponsored TechValidate survey of Riak users revealed the following findings:

Respondents were asked to list the greatest concerns facing developers when considering whether to start IoT projects. The majority cited application stability at scale as the greatest concern (see research chart).
Survey respondents agree on three major pain points developers face when managing applications that use their data. “Reliability,” “ease of operations” and “scalability” are the issues most respondents have in common.
More than 70% of those surveyed agreed that technology infrastructures required to support IoT initiatives will face greater scale challenges than big data initiatives of years past.
Nearly all respondents believe NoSQL solutions will be the databases of choice for supporting IoT applications.

“Riak continues to focus on a multi-model approach to help organizations simplify data management for big data applications,” said Adam Wray, CEO, Riak. “Companies must architect data initiatives so that they accelerate the creation and deployment of data pipelines to collect, process and analyze IoT and big data. Riak’s Riak databases are purpose-built to handle the volume, variety and velocity challenges today’s data presents. The enhancements we make today will help empower more efficient analysis and better decisions in the future.”

Riak® TS 1.4 Release Highlights

The new release of Riak TS adds significant new features, including extended support for SQL commands that follow the SQL standard and the ability to expire unneeded data. Additional support for SQL commands means application developers don’t have to write code to perform common functions. The following new features are supported in Riak TS 1.4:

Data Expiry: Data expiry automatically removes aged data from the database. Time Series data accumulates over time and often there is no need to retain this data in the application database. Users can now configure global object expiration.
Group By: The “Group By” command is used in SELECT statements and groups the results of the query by one or more columns. An example where this can be useful is when aggregating an attribute of a device over time.
Show Tables: The command returns a list of all the tables in the schema.
Textual date and time formats: These formats are supported for INSERT and SELECT statements. In addition to timestamps, ISO 8601 date/time formats can be used when inserting data and running range queries.

Riak® KV 2.2 Release Highlights

Riak continues to support and develop Riak KV and in the coming weeks will release Riak KV 2.2, which includes features and enhancements identified via community and customer feedback. These enhancements include:

Improved Active Anti-Entropy performance by enabling more consistent hashing of metadata.
Support for HyperLogLog Data Type: Added HyperLogLog to the CRDT based built in Riak Distributed Data Types providing developers the ability to estimate the cardinality of an input set.

Enhanced Configuration Controls: Enable simplified operations by adding the ability to have fine grain control over commands that, when run, could have a performance impact on the Riak cluster.
Improved Search: Enhanced support for Apache Solr.
LZ4 compression: LZ4 compression provides more effective and faster compression of data for enhanced cluster performance.
Introduction of support for Debian 8 “Jessie”, and
The ability to expire data globally in LevelDB.

Details on Riak’s Apache Spark Connector are available in a recent blog post. Release notes are also available. The enhanced Riak Mesos Framework provides the ability to deploy and run Riak KV and Riak TS on Apache Mesos or Mesosphere DC/OS clusters. Release notes are available.

Industry Support for Riak Solutions

“With Mantl, Cisco continues to be a strong supporter of the open source technologies that are making it possible to scale applications to meet the enormous demands of big data. We are excited to see Riak TS continue to evolve to meet the challenges associated with leveraging time series data. Riak Technologies’ efforts to further integrate Riak with Apache Spark and Apache Mesos will ease the development burden on those looking to build end to end scalable IoT and big data applications.”

— Kenneth Owens, CTO, cloud platforms and services group, Cisco Systems

Supporting Resources

About Riak Technologies

Riak Technologies, the creator of the world’s most resilient databases, is dedicated to developing disruptive technology that simplifies enterprises’ most critical distributed systems data management challenges. Riak has attracted one of the most talented groups of engineers and technical experts ever assembled devoted exclusively to solving some of the most complex issues presented by Big Data and IoT. Riak’s distributed database, Riak® KV, the industry leading distributed NoSQL database, is used by fast growing Web businesses and by one-third of the Fortune 50 to power their critical Web, mobile and social applications. Built on the same foundation, Riak introduced Riak® TS, which is the first enterprise-ready NoSQL database specifically optimized to store, query and analyze time series data. Riak also helps enterprises reduce the complexity of supporting Big Data applications by integrating Riak KV, Riak TS and Riak® S2 with Apache Spark, Mesos, Redis, and Apache Solr.

Riak is the registered trademark of Riak Technologies, Inc. The trademarks and names of other companies and products mentioned herein are the property of their respective owners.

Microservices – Please, don’t

Sean Kelly — Wed, 14 Sep 2016 14:36:06 +0000

This blog post is adapted from a lightning talk I gave at the Boston Golang meetup in December of 2015.

For a while, it seemed like everyone was crazy for microservices. You couldn’t open up your favorite news aggregator of choice without some company you had never heard of touting how the move to microservices had saved their engineering organization. You may have even worked for one of those companies that got swept up in all the hype around these tiny, magical little services and how they were going to solve all of the problems in your big, ailing, legacy codebase.

Of course, in hindsight, nothing could have been further from the truth. The beauty of hindsight is that it’s often much closer to the 20/20 vision we thought we had looking forward, all those months ago.

I’m going to cover a few of the major fallacies and “gotchas” of the Microservices movement, coming from someone who worked at a company that also got swept up in the idea that breaking apart a legacy monolithic application was going to save the day. While I don’t want the takeaway of this blog post to be “Microservices == Bad”, ideally anyone reading this should walk away with a series of issues to think about when deciding if the move to a Microservice based architecture is right for them.

What is a “Micro Service” anyways?

There really is no perfect definition of what does and does not constitute a microservice, although a few people who really champion the approach have codified it to a fairly reasonable set of requirements.

Tautologically, it is not a monolith. What this actually means in practice is that a microservice only deals with as limited an area of the domain as possible, so that it does as few things as necessary to serve its defined purpose in your stack. To give you a more concrete example, if you were a bank with a “Login Service”, the last thing you’d want it to do is have access to the records of your users’ financial transactions. You’d push that out to a “Transaction Service” of some kind (keep in mind, naming things is very hard).

Additionally, when people talk about microservices they often are implicitly talking about services that need to speak to others remotely. Since they’re distinct processes, and quite often running in locations that are remote from each other, it’s common to build these services so they speak over the network using REST, or some kind of RPC protocol.

At the outset, this actually seems pretty simple – we’ll just wrap tiny pieces of the domain in a REST API of some kind, and we’ll just have everyone talk to each other over the network. In my experience, there are 5 “truths” that people believe about this approach which are not always true:

It keeps the code cleaner
It’s easy to write things that only have one purpose
They’re faster than monoliths
It’s easy for engineers to not all work in the same codebase
It’s the simplest way to handle autoscaling, plus Docker is in here somewhere

Fallacy #1: Cleaner Code

“You don’t need to introduce a network boundary as an excuse to write better code”

The simple fact of the matter is that microservices, nor any approach for modeling a technical stack, are a requirement for writing cleaner or more maintainable code. It is true that since there are less pieces involved, your ability to write lazy or poorly thought out code decreases, however this is like saying you can solve crime by removing desirable items from store fronts. You haven’t fixed the problem, you’ve simply removed many of your options.

A popular approach is to architect the internals of your code around logical “services” that own a piece of the domain. This mirrors the concepts of a microservice in that it helps you to keep the dependencies needed for managing the domain explicit, as well as helps you to keep your key business logic from sprawling into multiple places. Additionally, using these services no longer incurs excess use of the network, nor potential error cases that arise from it.

A further benefit of this approach, given that it very closely mirrors a Service Oriented Architecture built around microservices, is that once you decide you should move to a microservice approach, you’ve already done a good deal of the design work up front, and likely understand your domain well enough to be able to extract it. A solid SOA approach begins in the code itself and moves out into the physical topology of the stack as time moves on.

Fallacy #2: It’s Easier

“Distributed Transactions are never easy”

While it might seem simple at the outset, most domains (especially in newer companies which need to prototype, pivot, and generally re-define the domain itself many times) do not lend themselves to being neatly carved into little boxes. Often times, a given piece of the domain needs to reach out and get data about other parts to do its job correctly. This becomes even more complex when it needs to delegate the responsibility of writing data outside of its own domain. Once you’ve broken out of your own area of influence, and need to involve others in the request flow to store and modify data, you’re in the land of Distributed Transactions (sometimes known as Sagas).

There is a lot of complexity wrapped in the problem of involving multiple remote services in a given request. Can you call them in parallel, or must they be done serially? Are you aware of all of the possible errors (both application and network level) that could arise at any point in the chain, and what that means for the request itself? Often, each of these distributed transactions needs its own approach for handling the failures that could arise, which can be a lot of work not only to understand the errors, but to determine how to handle and recover for each of them.

Fallacy #3: It’s Faster

“You could gain a lot of performance in a monolith by simply applying a little extra discipline”

This is a tough one to dispel because in truth you often can make individual systems faster by paring down the number of things they do, or the number of dependencies they load up, etc etc.

But ultimately, this is a very anecdotal claim. While I have no doubt folks who pivoted to microservices saw individual code paths isolated inside of those services speed up, understand that you’re also now adding the network in-between many of your calls. The network is never as fast as co-resident code calls, although often times it can be “fast enough”.

Additionally, many of these stories about performance gains are actually touting the benefits of a new language or technology stack entirely, and not just the concept of building out code to live in a microservice. Rewriting an old Ruby on Rails, or Django, or NodeJS app into a language like Scala or Go (two popular choices for a microservice architecture) is going to have a lot of performance improvements inherent to the choice of technology itself. But these languages don’t really “care” that you chose to describe the process they run in as “micro”, they simply perform better due to things like compilation.

Further, for a majority of apps in the startup space that are just starting out, raw CPU or Memory performance is almost never your problem. It’s I/O – and additional network calls is only adding further I/O to your profile.

Fallacy #4: Simple for Engineers

“A bunch of engineers working in isolated codebases leads to ‘Not my problem’ syndrome”

While on the tin it might seem simpler to have a smaller team focused on one small piece of the puzzle, ultimately this can often lead to many other problems that dwarf the gains you might see from a having a smaller problem space to tackle.

The biggest is simply that to do anything, you have to run an ever-increasing number of services to make even the smallest of changes. This means you have to invest time and effort into building and maintaining a simple way for engineers to run everything locally. Things like Docker can make this easier, but someone still needs to maintain these as things change.

Additionally, it also makes writing tests more difficult, as to write a proper set of integrations tests means understanding all of the different services a given interaction might invoke, capturing all of the possible error cases, etc etc. There is even more time spent on simply understanding the system, which could better be spent continuing to develop it. While I would never tell any engineer that time spent understanding a system is time wasted, I would definitely warn people away from prematurely adding these levels of complexity until they know they need it.

Finally, it also creates social problems as well. Bugs that span multiple services and require many changes can languish as multiple teams need to coordinate and synchronize their efforts on fixing things. It can also breed a situation where people don’t feel responsible, and will push as many of the issues onto other teams as possible. When engineers work together in the same codebase, their knowledge of each other and the system itself grows in kind. They’re more willing and capable when working together to tackle problems, as opposed to being the kings and queens of isolated little fiefdoms.

Fallacy #5: Better for Scalability

“You can scale a microservice outward just as easily as you can scale a monolith”

It’s not incorrect to say that packaging your services as discrete units which you then scale via something like Docker is a good approach for horizontal scalability.

However, it’s incorrect to say that you can only do this with something like a microservice. Monolithic applications work with this approach as well. You can create logical clusters of your monolith which only handle a certain subset of your traffic. For example, inbound API requests, your dashboard front end, and your background jobs servers might all share the same codebase, but you don’t need to handle all 3 subsets of work on every box.

The benefit here, like it exists in a microservice approach, is that you can tune individual clusters to their given workload, as well as scale them individually in response to a surge in traffic to a given workload. So while a microservice approach guides you into this approach from the get go, you can apply the exact same method of scaling your stack to a more monolithic process as well.

When should you use microservices?

“When you’re ready as an engineering organization”

I’d like to close by going over when it could be the right time to pivot to this approach (or, if you’re starting out, how to know if this is the right way to start).

The single most important step on the path to a solid, workable approach to microservices is simply understanding the domain you’re working in. If you can’t understand it, or if you’re still trying to figure it out, microservices could do more harm than good. But if you have a deep understanding, then you know where the boundaries are, what the dependencies are, so a microservices approach could be the right move.

Another important thing to have a handle on is your workflows – specifically how they might relate to the idea of a Distributed Transaction. If you know the paths each category of request will make through your system, and you understand where, how, and why each of those paths might fail, you could start to build out a distributed model of handling your requests.

Alongside understanding your workflows is monitoring your workflows. Monitoring is a subject greater than just “Microservice VS Monolith”, but it should be something at the core of your engineering efforts. You may need a lot of data at your fingertips about various parts of your systems to understand why one of them is underperforming, or even throwing errors. If you have a solid approach for monitoring the various pieces of your system, you can begin to understand your systems behaviors as you increase its footprint horizontally.

And finally, when you can actually demonstrate value to your engineering organization, and the business as well, that moving to microservices will help you grow, scale, and make money. Although it’s fun to build things and try new ideas out, at the end of the day the most important thing for many companies is their bottom line. If you have to delay putting out a new feature that will make the company revenue because a blogpost told you monoliths were “doing it wrong”, you’re going to need to justify that to the business. Sometimes these tradeoffs are worth it. Sometimes they aren’t. Knowing how to pick your battles and spend time on the right technical debt will earn you a lot of credit in the long run.

Takeaways

Hopefully, you have a new series of conditions and questions to go over the next time someone is suggesting a microservices approach. As I opened with, my aim was not to tell you that microservices are bad; Rather, that jumping into them without thinking through all of the concerns is a recipe for problems down the road.

If you were to ask me, I’d advocate for building “Internal” services via cleanly defined modules in code, and carve them out into their own distinct services if a true need arises over time. This approach isn’t necessarily the only way to do it, and it also isn’t a panacea against bad code on its own. But it will get you further, faster, than trying to deal with a handful or more microservices before you’re ready to do so.

Sean Kelly (affectionately known as “Stabby”) has been a software engineer for over 12 years. He is an inaugural member of Riak’s “Taishi” developer program, and currently works for Komand, a security orchestration and automation platform that empowers security teams to quickly automate and streamline security operations, with no need for code. Komand enables teams to connect their tools, build dynamic workflows, and utilize human insight to accelerate incident response and move forward, faster. You can reach him on twitter @StabbyCutyou.

Enterprise IoT vs. Consumer IoT

Stephen Condon — Thu, 08 Sep 2016 20:54:43 +0000

Riak’s Spark-Riak Connector 1.6.0 Now Available

Korrigan Clark — Wed, 07 Sep 2016 23:14:00 +0000

Riak’s Engineering team is proud to announce that version 1.6.0 of the Spark-Riak connector has been released. Over the past several months we have added new features, upgraded existing features, and fixed bugs to enable our customers to take full advantage of the combined power of Riak and Apache Spark.

We’ve listened to our Community and included several key features:

Python support for Riak KV buckets
Support for Spark Streaming
Failover support for when a Riak node is unavailable during a Spark job

We also added performance and testing enhancements, upgraded example applications and added documentation. A full list of changes can be found in the CHANGELOG. All new features will work with TS tables and KV buckets in Riak TS 1.3.1+, and support for Riak KV will come with the release of Riak KV 2.3.[updated October 19, 2016]

Let’s take a peek at one of the more exciting features of this release: Python support for Riak KV buckets. For the purposes of this quick demonstration, [we will assume there is a Riak TS 1.4 node at 127.0.0.1:8087] and that the code will be run in a Jupyter notebook.

Before we can use Spark, we must set up a Spark context using the Spark-Riak connector:

import findspark
findspark.init()
import pyspark

import pyspark_riak
import os

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.basho.riak:spark-riak-connector:1.6.0 pyspark-shell"
conf = pyspark.SparkConf().setAppName("My Spark Riak App")
conf.set("spark.riak.connection.host", "127.0.0.1:8087")
sc = pyspark.SparkContext(conf)

pyspark_riak.riak_context(sc)

Now that the Spark context has been properly initialized for use with Riak, let’s write some data to a KV bucket named ‘kv_sample_bucket’:

sample_data = [ {'key0': {'data': 0}}, {'key1': {'data': 1}}, {'key2': {'data': 2}} ]

kv_write_rdd = sc.parallelize(sample_data)

kv_write_rdd.saveToRiak(‘kv_sample_bucket’)

Now that Spark has written our data in parallel to a KV bucket, let’s pull that data out with a full bucket read:

kv_read_rdd = sc.riakBucket(‘kv_sample_bucket’).queryAll()

print(kv_read_rdd.collect()) # prints  [ {'key0': {'data': 0}}, {'key1': {'data': 1}}, {'key2': {'data': 2}} ]

There are several other ways to read data from a KV bucket including simple key queries, 2i tag queries, and 2i range queries. Additionally, with 2i range queries, custom partitioning and parallelization can be used to increase read performance. More information on python support for the Spark-Riak connector can be found in the docs.

To get started using the new version of the Spark-Riak connector, we encourage you to visit the github repository and start playing around with all the new features.
Korrigan Clark
@kralCnagirroK

Don’t Fall For ‘IoT-Washing’ Tactics; They Won’t Get You (Or Your Data) Far by Adam Wray

Stephen Condon — Wed, 07 Sep 2016 18:04:32 +0000

NoSQL Riak TS Gets JDBC Driver Inspired by SQL

Craig Vitter — Wed, 31 Aug 2016 00:25:07 +0000

When Riak’s engineering team released Riak TS 1.0 back in December 2015 one of the features that I found most exciting was its use of standard SQL. I know that there aren’t a lot of people who get excited by SQL in this era of NoSQL databases but SQL isn’t dead just yet. In the 30+ years that SQL has been in use, it has had the opportunity to find itself integrated into the vast majority of databases and reporting tools used by enterprises. Essentially SQL has become the lingua franca of data analysis and by making SQL the query language of Riak TS, Riak made the database accessible to a wider range of potential users.

As cool as that is, as a developer, I realized that the use of SQL also made it possible to build a JDBC (Java Database Connectivity) driver for Riak TS. If you aren’t already familiar with the JDBC API, it provides Java applications standardized methods to connect to, query, and update data in any database (almost exclusively relational databases) that provides a JDBC driver. As an official part of the Java language since 1997, JDBC has been widely adopted by developers. If you use a reporting tool like those available from Cognos, Microstrategy, Business Objects, or Jaspersoft, then you can connect to any data source that provides a JDBC driver.

Once I realized how important a JDBC driver would be for Riak TS, I was compelled to write one. When I started down the path of writing a JDBC driver for Riak TS my goal was simply to use it as a learning opportunity, I wasn’t really convinced that I would have the time or ability to produce something that would be generally useful. As I started working on the driver the learning exercise became a viable project and so now I’ve decided to open source the project and share my work with the community:

https://github.com/basho-labs/Riak-TS-JDBC-Driver

There are two main reasons why you would want to use the JDBC Driver:

You are a Java application developer familiar with the JDBC API and want to integrate Riak TS into an application;
You use reporting tools like BusinessObjects, Cognos, or Jaspersoft that allow you to connect to databases using JDBC drivers.

If you have one or more of the proceeding uses for a JDBC driver for Riak TS check out the ReadMe at https://github.com/basho-labs/Riak-TS-JDBC-Driver/tree/master/riakts.jdbc.driver for full details on the driver’s capabilities and how to get started using it. And if you do use the driver please leave feedback, submit issues, or submit pull requests.

You can learn more about Riak TS at these links:

Please reach out via twitter with any feedback.

Craig Vitter

@craigvitter

Why Data Lakes Are Evil

Stephen Condon — Fri, 26 Aug 2016 17:40:33 +0000

Riak CTO: It’s high IoT time for time series data

Toni Vicars — Wed, 24 Aug 2016 17:32:57 +0000