March 17, 2015
As a Solutions Architect for Basho, I’m often called upon by customers to explain Riak. Frequently this is in the context of a specific use-case they are deploying, or a problem they are attempting to solve. In each of these cases, an important component of the conversation is ensuring a base level of understanding of the design principles in Riak’s distributed architecture. It is this architecture, and these early design decisions that ensure that Riak “just works” and why companies choose Riak for their business critical applications.
This blog highlights key reasons why Riak “just works” from a high availability, fault tolerant, data distribution, predictable performance and operational simplicity perspective. It also shows how Riak addresses data replication, detection of node failures, read latency, and rolling restarts due to upgrades, failures or operating system related issues.
Uniform Data Distribution and Predictable Performance
If you have ever seen a presentation about Riak, you have seen an inclusion of the “ring diagram.” The image is fairly simple and one that we use to describe data distribution, scalability, and performance characteristics of Riak. But the underlying architectural implementation that this ring represents is an important characteristic of Riak’s architecture.
Riak employs a SHA1 based consistent hashing algorithm that is mathematically proven to produce a perfectly uniform distribution about a 160-bit space (the ring). This 160-bit space is divided into equal partitions called “vnodes.” These vnodes, in turn, are evenly distributed amongst participating physical nodes in the cluster. Uniform distribution about the 160-bit space and an even allocation of vnodes amongst physical nodes ensures keys are uniformly distributed about the cluster. Participating nodes in a Riak cluster are homogeneous – meaning any node can service any request – and due to the nature of consistent hashing, every node in the cluster knows where data should reside within the cluster.
Riak’s default behavior is to seek equilibrium. Each node in a cluster is responsible for 1/nth data within the cluster and 1/nth total performance, where n represents total node count in a cluster. This architectural principle allows operators to make reasonable assumptions about Riak’s linear scalability. Often in sharded systems, access patterns that disproportionately access specific ranges of data will cause “hot spots” in a cluster making predictable operations more difficult to maintain. Uniform data distribution about the cluster (and Hinted Handoff described below) allows for continuous normal operations of an active Riak cluster while maintaining a predictable level of performance as any node is removed from the cluster for any reason. Even in resource-starved virtual environments, a Riak cluster will work to maintain its equilibrium and equal data distribution such that operators may assemble larger numbers of slower individual virtual machines to achieve their desired performance profile in aggregate. Additionally, each physical node in a Riak cluster maintains its own performance statistics that are easily accessible and parsable such that an advanced deployment would be able to wrap those statistics into a sick-node detection algorithm based on their own specific thresholds.
Furthering the predictable performance rationale, Riak uses vector clocks, specifically dotted version vectors in Riak 2.0, to internally reason about conflict resolution as it relates to multiple updates to the same object. Updates to any object are independent of updates to any other object and thus there is no locking in Riak for any of its core operations – reads, writes and deletes. There are no global or local locks on any tables or rows – those structures do not exist in Riak. Locks introduce nondeterministic delays in performance yet are often a necessary component of any absolutely consistent database. Riak on the other hand is an eventually consistent database (caveat Riak 2.0’s strong consistency feature at the expense of availability). Additionally, because the base level of abstraction in Riak is the vnode, and data is replicated amongst a set of vnodes operating on distinct physical hardware, background processes such as file compaction is scheduled in such a way that it only affects a segment of the cluster at any given time. This rolling compaction ensures a high degree of availability with minimal performance penalty.
High Availability and Failure Recovery
The measure of a distributed system is not how well that system runs under optimal conditions in the general case, but rather how that system performs in the face of failure. An architect must ask herself how well her system will perform in the face of node failure and how well that system will recover from failure. Failures happen with increasing frequency as the size of your system grows. Riak implements a number of technologies that when combined ensure Riak excels at failure recovery. These technologies provide a baseline set of features that allow Riak to quickly recover from failure scenarios with minimal operator intervention. Not only will these built-in recovery mechanisms maintain eventual consistency within the database but they will also maintain synchronicity with features such as Solr’s full text indexing and Multi Datacenter Replication.
Hinted Handoff ensures data is replicated an appropriate number of times in spite of failure by allowing a node to take over responsibility for a vnode and then return that data to its original “owner.” Handoff from one vnode to another can happen on a temporary (in the case of failure) or permanent (in the case of cluster resizing) basis. In both cases, Riak handles this automatically while the cluster remains available. Because Riak is able to dynamically allocate vnode assignments continuously, the cluster can absorb the loss of any physical node(s) for write operations and ensure availability of data as long as one vnode of any replica set is still accessible. This allows Riak to maintain availability when a node is removed from the cluster for any reason – whether scheduled or not. Ultimately, an unscheduled failure or a scheduled upgrade of Riak or the operating system results in the removal of a node from the cluster – Riak’s core architecture and capabilities accommodate this behavior with minimal operator intervention.
Read Repair is a mechanism triggered on a successful read of a value where all replicas may not agree on the value. There are two possibilities where this can happen: when one replica returns not found – meaning it doesn’t have a copy, and when one replica returns a value where the vector clocks is an ancestor of the successful read. When this situation occurs, Riak will force the errant nodes to update the object’s value based on the value of the successful read.
Finally, Active Anti Entropy corrects inconsistent data continuously in the background. Where Read Repair corrects data at read time, AAE corrects all data regardless of whether or not it is actively accessed by running a background process that continuously checks for inconsistencies. AAE uses Merkle Tree hash exchanges between vnodes to look for these inconsistencies. When a difference at the top of the tree is detected, Riak recursively checks the tree until it finds the exact values with a difference between vnodes and then sends the smallest amount of data necessary to regain equilibrium.
Architecture Makes a Difference
Riak’s core architecture and key technical features provide the building blocks valued in a highly distributed environment. Uniform data distribution, homogeneous nodes, hinted handoff, read repair and active anti entropy all play a role in providing the high availability, fault tolerance, predictable performance and operational simplicity developers and managers are looking for from their non-relational persisted data solutions.
We’ve seen Riak adopted across a wide variety of verticals and for a broad range of use cases. From gaming to retail to advertising to mobile, our customers begin by identifying a workload, or use case, where availability, scalability, and fault tolerance are critical. We then work closely with these customers to ensure Riak is an ideal fit for the architectural design and business requirements. This process often begins with a Tech Talk. Someone like me, working with you either onsite or remotely, to assess how Riak can help you solve your critical business requirements. You can sign up for a Tech Talk here.
February 17, 2015
According to TechTarget, a common definition of “High Availability” is:
“In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to “100% operational” or “never failing.”
The reality is that this phrase has become semantically overloaded by its inclusion in marketing copy across a disparate set of technologies. Much like “Big Data”, perspectives on availability vary based on industry and customer expectation.
For many of today’s applications and platforms, high availability has a direct impact on revenue. A few examples include: cloud services, online retail, shopping carts, gaming and betting, and advertising. Further, lack of availability can damage user trust and result in a poor user experience for many social media and chat applications, websites, and mobile applications. Riak provides the high availability needed for your critical applications.
Availability – By the Numbers
As we highlighted in an infographic entitled Down with Downtime, more than 95% of businesses with 1,000+ employees estimate that they lose more than $100,000 for every 1 hour of downtime. For more than 1 in 2 large businesses, the cost of downtime amounts to more than $300,000 per hour. At the lower end of this scale, this is $83 dollars per minute. At the upper end of the spectrum (in financial services) it can amount to $1,800 a second of downtime.
This fiscal impact has resulted in availability being measured as a percentage calculation of uptime in a given year. This percentage is often referred to as the “number of 9s” of availability. For example, “one nine” of availability equates to 90% uptime in a year. Similarly, “five nines” (the standard that was set by consulting firms on enterprise projects) equates to 99.999% availability in a year. While that percentage is often referenced, the practical reality is that it means there can be no more than 6.05 seconds of unplanned downtime per week.
Availability – A Feature or A Benefit?
Often, when describing Riak, I begin by explaining the benefits of Riak (availability, scalability, fault tolerance, operational simplicity) and then discuss, in detail, the properties that these benefits are derived from. Availability is not something that can be added to a system (be it a distributed database or otherwise), rather it is an outcome of the core architectural decisions that were made in the development of the product.
Consider, for example, the AXD 301 ATM switch. It, reportedly, delivers at or better than “nine nines” (99.9999999%) of availability to customers. This is a staggering number that requires NO MORE than 6.048 milliseconds of downtime per week. Interestingly, it shares a common architectural component with Riak also being developed in Erlang.
“How does Riak achieve high availability?” Or, perhaps better stated as, “What are the architectural decisions made in Riak that enable high availability?”
Availability – An Architectural Decision
Riak is a masterless system designed for high availability, even in the event of hardware failures or network partitions. Any server (termed a “node” in Riak) can serve any incoming request and all data is replicated across multiple nodes. If a node experiences an outage, other nodes will continue to service read and write requests. Further, if a node becomes unavailable to the rest of the cluster, a neighboring node will take over the responsibilities of the missing node. The neighboring node will pass new or updated data (termed “objects”) back to the original node once it rejoins the cluster. This process is called “hinted handoff” and it ensures that read and write availability is maintained automatically to minimize your operational burden when nodes fail or comes back on-line.
More information about the architectural decisions involved in Riak’s design are available in our documentation. In particular, the Concepts – Clusters section is deeply illustrative.
Availability – A Use Case
Consider, for example the implementation of Riak at Temetra. Temetra has thousands of users and millions of meters that create billions of data points. The massive influx of data that was being generated quickly became difficult to manage with the company’s legacy SQL database. When considering how this structured database could be overhauled, Temetra conducted evaluations with Cassandra and Hadoop but ultimately chose Riak due to its high availability, relatively self-maintaining and easy to deploy infrastructure. It is essential that the data collected from the meters is always available as it is relied on to determine correct billing for Temetra’s customers.
Availability – A Summary
The reality is that a database, even a distributed, masterless, multi-model platform like Riak, is only one component of the application stack. Understanding your availability requirements requires deep knowledge of the entirety of the deployment environment. “High Availability” cannot be retrofit into a system. Rather it requires conscious effort in the early stages to ensure that customer requirements are met and that downtime does not result in lost customers and lost revenue.
Thousands have watched and enjoyed Peter Alvaro’s engaging and informative RICON 2014 Keynote presentation. Alvaro is a PhD candidate at the University of California Berkeley. His research interests lie at the intersection of databases, distributed systems, and programming languages. Alvaro’s style of delivery blends humor with deep technical detail and is especially informative for those interested in distributed systems.
In his presentation, Alvaro discusses 4 key ideas:
- Mourning the death of transactions
- What is so hard about distributed systems?
- Distributed consistency: managing asynchrony
- Fault-tolerance: progress despite failures
Alvaro starts his presentation by introducing us to Jim Gray and transactional systems. Many of you may know Gray’s work, and, sadly, that he was lost at sea in January 2007. His spirit and legacy are missed.
Alvaro provides insights into transactional systems and the top-down approach these systems traditionally used. He also points out that Eric Brewer, in his RICON 2012 keynote address, suggested that a bottoms-up approach might be needed for today’s distributed systems.
Alvaro dives into why anyone would implement distributed systems and why developing distributed systems is hard, really hard. In a distributed system, it is necessary to manage two fundamental uncertainties or failure modes — asynchrony and partial failure. Alvaro uses a humorous metaphor of two clowns to demonstrate how, in the real world, asynchrony and partial failure can’t be dealt with separately, but must be looked at together.
From his humorous metaphor come some definitions:
Distributed consistency = managing asynchrony
Fault-tolerance = progress despite failures
Alvaro then provides details on distributed consistency and when data is distributed, how consistency is handled. First, start with object-level consistency. Alvaro introduces and defines CRDTs and how these replicated data types help solve the distributed consistency challenge at the object level.
But what happens as objects are in flight? There must also be flow-level consistency for data in motion. Language-level consistency can help with this problem. Alvaro makes the following key points:
Consistency is tolerance to asynchrony
Tip: Focus on data in motion, not at rest
Alvaro then moves from distributed consistency to fault tolerance. He discusses his most recent research “lineage-driven fault injection.” He reminds us that we build systems of components and we verify these components to be fault tolerant.
However, when we put these components together it doesn’t guarantee end-to-end fault tolerance.
Alvaro talks about the challenges of the top-down approach to testing all components in a system and outlines the goal of lineage-driven fault injection (LDFI).
Alvaro then introduces us to Molly, a top-down fault injector.
He describes Molly like starting from the middle of a maze and moving to the outside as a method to arrive at a solution.
Alvaro provides detailed examples to show modeling programs using lineage so that fault tolerance can be analyzed. He then shows how the role of the adversary can be automated. He describes Molly in more detail as a prototype LDFI. Molly finds fault-tolerance violations quickly or guarantees that none exist. Alvaro provides some output using Molly and shows how lineage allows you to reason backwards from good outcomes.
Alvaro closes with a recap and explanation describing composition as the hardest problem of distributed systems.
Don’t miss this interesting and informative presentation.
Also, KDnuggets did a follow-up interview with Alvaro in which he expanded on some points made in his RICON 2014 Keynote speech. Here are links to the 2-part article:
A Muddle That Slows Adoption
Basho released Riak as an open source project seven months ago and began commercial service shortly thereafter. As new entrants into the loose collection of database projects we observed two things:
- Widespread Confusion — the NoSQL rubric, and the decisions of most projects to self-identify under it, has created a false perception of overlap and similarity between projects differing not just technically but in approaches to licensing and distribution, leading to…
- Needless Competition — driving the confusion, many projects (us included, for sure) competed passionately (even acrimoniously) for attention as putative leaders of NoSQL, a fool’s errand as it turns out. One might as well claim leadership of all tools called wrenches.
The optimal use cases, architectures, and methods of software development differ so starkly even among superficially similar projects that to compete is to demonstrate a (likely pathological) lack of both user needs and self-knowledge.
This confusion and wasted energy — in other words, the market inefficiency — has been the fault of anyone who has laid claim to, or professed designs on, the NoSQL crown.
- Adoption suffers — Users either make decisions based on muddled information or, worse, do not make any decision whatsoever.
- Energy is wasted — At Basho we spent too much time from September to December answering the question posed without fail by investors and prospective users and clients: “Why will you ‘win’ NoSQL?”
With the vigor of fools, we answered this question, even though we rarely if ever encountered another project’s software in a “head-to-head” competition. (In fact, in the few cases where we have been pitted head-to-head against another project, we have won or lost so quickly that we cannot help but conclude the evaluation could have been avoided altogether.)
The investors and users merely behaved as rational (though often bemused) market actors. Having accepted the general claim that NoSQL was a monolithic category, both sought to make a bet.
Clearly what is needed is objective information presented in an environment of cooperation driven by mutual self-interest.
This information, shaped not by any one person’s necessarily imperfect understanding of the growing collection of data storage projects but rather by all the participants themselves, would go a long way to remedying the inefficiencies discussed above.
Demystification through data, not marketing claims
We have spoken to representatives of many of the emerging database projects. They have enthusiastically agreed to participate in a project to disclose data about each project. Disclosure will start with the following: a common benchmark framework and benchmarks/load tests modeling common use cases.
- A Common Benchmark Framework — For this collaboration to succeed, no single aspect will impact success or failure more than arriving at a common benchmark framework.
At Basho we have observed the proliferation of “microbenchmarks,” or benchmarks that do not reflect the conditions of a production environment. Benchmarks that use a small data set, that do not store to disk, that run for short (less than 12 hours) durations, do more to confuse the issue for end users than any single other factor. Participants will agree on benchmark methods, tools, applicability to use cases, and to make all benchmarks reproducible.
Compounding the confusion is when benchmarks are used for comparison of different use cases or was run on different hardware and yet compared head-to-head as if the tests or systems were identical. We will seek to help participants run equivalent on the various databases and we will not publish benchmark results that do not profile the hardware and configuration of the systems.
- Benchmarks That Support Use Cases — participants agree to benchmark their software under the conditions and with load tests reflective of use cases they commonly see in their user base or for which they think their software is best suited.
- Dissemination to third-parties — providing easy-to-find data to any party interested in posting results.
- Honestly exposing disagreement — Where agreement cannot be reached on any of the elements of the common benchmarking efforts, participants will respectfully expose the rationales for their divergent positions, thus still providing users with information upon which to base decisions.
There is more work to be done but all participants should begin to see the benefits: faster, better decisions by users.
We invite others to join, once we are underway. We, and our counterparts at other projects, believe this approach will go a long way to removing the inefficiencies hindering adoption of our software.