March 17, 2015
As a Solutions Architect for Basho, I’m often called upon by customers to explain Riak. Frequently this is in the context of a specific use-case they are deploying, or a problem they are attempting to solve. In each of these cases, an important component of the conversation is ensuring a base level of understanding of the design principles in Riak’s distributed architecture. It is this architecture, and these early design decisions that ensure that Riak “just works” and why companies choose Riak for their business critical applications.
This blog highlights key reasons why Riak “just works” from a high availability, fault tolerant, data distribution, predictable performance and operational simplicity perspective. It also shows how Riak addresses data replication, detection of node failures, read latency, and rolling restarts due to upgrades, failures or operating system related issues.
Uniform Data Distribution and Predictable Performance
If you have ever seen a presentation about Riak, you have seen an inclusion of the “ring diagram.” The image is fairly simple and one that we use to describe data distribution, scalability, and performance characteristics of Riak. But the underlying architectural implementation that this ring represents is an important characteristic of Riak’s architecture.
Riak employs a SHA1 based consistent hashing algorithm that is mathematically proven to produce a perfectly uniform distribution about a 160-bit space (the ring). This 160-bit space is divided into equal partitions called “vnodes.” These vnodes, in turn, are evenly distributed amongst participating physical nodes in the cluster. Uniform distribution about the 160-bit space and an even allocation of vnodes amongst physical nodes ensures keys are uniformly distributed about the cluster. Participating nodes in a Riak cluster are homogeneous – meaning any node can service any request – and due to the nature of consistent hashing, every node in the cluster knows where data should reside within the cluster.
Riak’s default behavior is to seek equilibrium. Each node in a cluster is responsible for 1/nth data within the cluster and 1/nth total performance, where n represents total node count in a cluster. This architectural principle allows operators to make reasonable assumptions about Riak’s linear scalability. Often in sharded systems, access patterns that disproportionately access specific ranges of data will cause “hot spots” in a cluster making predictable operations more difficult to maintain. Uniform data distribution about the cluster (and Hinted Handoff described below) allows for continuous normal operations of an active Riak cluster while maintaining a predictable level of performance as any node is removed from the cluster for any reason. Even in resource-starved virtual environments, a Riak cluster will work to maintain its equilibrium and equal data distribution such that operators may assemble larger numbers of slower individual virtual machines to achieve their desired performance profile in aggregate. Additionally, each physical node in a Riak cluster maintains its own performance statistics that are easily accessible and parsable such that an advanced deployment would be able to wrap those statistics into a sick-node detection algorithm based on their own specific thresholds.
Furthering the predictable performance rationale, Riak uses vector clocks, specifically dotted version vectors in Riak 2.0, to internally reason about conflict resolution as it relates to multiple updates to the same object. Updates to any object are independent of updates to any other object and thus there is no locking in Riak for any of its core operations – reads, writes and deletes. There are no global or local locks on any tables or rows – those structures do not exist in Riak. Locks introduce nondeterministic delays in performance yet are often a necessary component of any absolutely consistent database. Riak on the other hand is an eventually consistent database (caveat Riak 2.0’s strong consistency feature at the expense of availability). Additionally, because the base level of abstraction in Riak is the vnode, and data is replicated amongst a set of vnodes operating on distinct physical hardware, background processes such as file compaction is scheduled in such a way that it only affects a segment of the cluster at any given time. This rolling compaction ensures a high degree of availability with minimal performance penalty.
High Availability and Failure Recovery
The measure of a distributed system is not how well that system runs under optimal conditions in the general case, but rather how that system performs in the face of failure. An architect must ask herself how well her system will perform in the face of node failure and how well that system will recover from failure. Failures happen with increasing frequency as the size of your system grows. Riak implements a number of technologies that when combined ensure Riak excels at failure recovery. These technologies provide a baseline set of features that allow Riak to quickly recover from failure scenarios with minimal operator intervention. Not only will these built-in recovery mechanisms maintain eventual consistency within the database but they will also maintain synchronicity with features such as Solr’s full text indexing and Multi Datacenter Replication.
Hinted Handoff ensures data is replicated an appropriate number of times in spite of failure by allowing a node to take over responsibility for a vnode and then return that data to its original “owner.” Handoff from one vnode to another can happen on a temporary (in the case of failure) or permanent (in the case of cluster resizing) basis. In both cases, Riak handles this automatically while the cluster remains available. Because Riak is able to dynamically allocate vnode assignments continuously, the cluster can absorb the loss of any physical node(s) for write operations and ensure availability of data as long as one vnode of any replica set is still accessible. This allows Riak to maintain availability when a node is removed from the cluster for any reason – whether scheduled or not. Ultimately, an unscheduled failure or a scheduled upgrade of Riak or the operating system results in the removal of a node from the cluster – Riak’s core architecture and capabilities accommodate this behavior with minimal operator intervention.
Read Repair is a mechanism triggered on a successful read of a value where all replicas may not agree on the value. There are two possibilities where this can happen: when one replica returns not found – meaning it doesn’t have a copy, and when one replica returns a value where the vector clocks is an ancestor of the successful read. When this situation occurs, Riak will force the errant nodes to update the object’s value based on the value of the successful read.
Finally, Active Anti Entropy corrects inconsistent data continuously in the background. Where Read Repair corrects data at read time, AAE corrects all data regardless of whether or not it is actively accessed by running a background process that continuously checks for inconsistencies. AAE uses Merkle Tree hash exchanges between vnodes to look for these inconsistencies. When a difference at the top of the tree is detected, Riak recursively checks the tree until it finds the exact values with a difference between vnodes and then sends the smallest amount of data necessary to regain equilibrium.
Architecture Makes a Difference
Riak’s core architecture and key technical features provide the building blocks valued in a highly distributed environment. Uniform data distribution, homogeneous nodes, hinted handoff, read repair and active anti entropy all play a role in providing the high availability, fault tolerance, predictable performance and operational simplicity developers and managers are looking for from their non-relational persisted data solutions.
We’ve seen Riak adopted across a wide variety of verticals and for a broad range of use cases. From gaming to retail to advertising to mobile, our customers begin by identifying a workload, or use case, where availability, scalability, and fault tolerance are critical. We then work closely with these customers to ensure Riak is an ideal fit for the architectural design and business requirements. This process often begins with a Tech Talk. Someone like me, working with you either onsite or remotely, to assess how Riak can help you solve your critical business requirements. You can sign up for a Tech Talk here.
July 2, 2013
We use the Erlang/OTP programming language in building our products here at Basho. We made that choice consciously, believing that it would be a tradeoff – significant benefits balanced by a handful of costs. I am often asked if we would make the same choice all over again. To answer that question I need to address the tradeoff we thought we were making.
The single most compelling reason to choose Erlang was the attribute for which it is best known: extremely high availability. The original design goal for Erlang was to enable rapid development of highly robust concurrent systems that “run forever.” The poster child of its success (outside Riak, of course) is the AXD 301 ATM switch, which reportedly delivers at or better than “nine nines” (99.9999999%) of uptime to customers. Since when we set out to build a database for applications requiring extremely high availability, Erlang was a natural fit.
We knew that Erlang’s supervisor concept, enabling a “let it crash” program designed for resilience, would be a big help for making systems that handle unforeseen errors gracefully. We knew that lightweight processes and a “many-small-heaps” approach to garbage collection would make it easier to build systems not suffering from unpredictable pauses in production. Those features paid off exactly as expected, and helped us a great deal. Many other features that we didn’t understand the full importance of at the time (such as the ability to inspect and modify a live system at run-time with almost no planning or cost) have also helped us greatly in making systems that our users and customers trust with their most critical data.
It turns out that our assessment of the key trade-off — a more limited pool of talented engineers — is, in practice, not a problem for a company like Basho. We need to hire great software developers, and we tend to look for ones with particular skills in areas like databases and/or distributed systems. If someone is a skilled programmer in relatively arcane disciplines like those, then the ability to learn a new programming language will not be daunting. While it’s theoretically a nice bonus for someone to bring knowledge of all the tools we use, we’ve hired a significant number of engineers that had no prior Erlang experience and they’ve worked out well.
This same purported drawback is a benefit in some ways. By not just looking for “X Engineers” (where X is Java, Erlang, or anything else), we make a statement both about our own technology decision-making process and the expected levels of interesting work at Basho. To help me work on my house, I’d rather have someone who self-identifies as an “expert carpenter” or “expert plumber,” not “expert hammer wielder,” even in the cases where most of the job might involve that tool. We expect developers at Basho to exercise deep, broad interests and expertise, and for them to do highly creative work. When we mention Erlang and the other thoughtful decisions we made in building our products, they value the roadmap and leadership.
I had an entertaining and ironic conversation about this recently with a manager at a large database company. He explained to me that we had clearly made the wrong choice, and that we should have chosen Java (like his team) in order to expand the recruiting pool. Then, without breaking a stride, he asked if I could send any candidates his way, to fill his gaps in finding talented people.
We continue to grow and to bring on great new engineers.
That’s not to say that there are no downsides. Any language, runtime, and community will bring with it different constraints and freedoms, making some tasks easier and others less so. We’ve done some work over the years to participate in the highly supportive Erlang community. But the big organizational weakness that so many people thought would come with the choice? It’s simply not a problem.
That lesson, combined with the ongoing technical advantages we enjoy because of Erlang, makes it easy to answer the question:
Yes, we would absolutely choose Erlang today.
November 18, 2009
Justin spends a little time discussing Riak and then quickly moves on to a discussion of first principles.
Justin’s presentation stands on its own but it is worth pointing out: terms like “scalable” and “distributed” and “fault tolerant” are not marketing terms. Applied rigorously, the principles underlying them (a hat tip to folks like Brewer, Lewin/Leighton/Karger et. al.) lead to game-changing software.
Building truly decentralized systems requires discipline. Shortcuts for premature optimization ultimately lead to a dead end.