March 17, 2015
As a Solutions Architect for Basho, I’m often called upon by customers to explain Riak. Frequently this is in the context of a specific use-case they are deploying, or a problem they are attempting to solve. In each of these cases, an important component of the conversation is ensuring a base level of understanding of the design principles in Riak’s distributed architecture. It is this architecture, and these early design decisions that ensure that Riak “just works” and why companies choose Riak for their business critical applications.
This blog highlights key reasons why Riak “just works” from a high availability, fault tolerant, data distribution, predictable performance and operational simplicity perspective. It also shows how Riak addresses data replication, detection of node failures, read latency, and rolling restarts due to upgrades, failures or operating system related issues.
Uniform Data Distribution and Predictable Performance
If you have ever seen a presentation about Riak, you have seen an inclusion of the “ring diagram.” The image is fairly simple and one that we use to describe data distribution, scalability, and performance characteristics of Riak. But the underlying architectural implementation that this ring represents is an important characteristic of Riak’s architecture.
Riak employs a SHA1 based consistent hashing algorithm that is mathematically proven to produce a perfectly uniform distribution about a 160-bit space (the ring). This 160-bit space is divided into equal partitions called “vnodes.” These vnodes, in turn, are evenly distributed amongst participating physical nodes in the cluster. Uniform distribution about the 160-bit space and an even allocation of vnodes amongst physical nodes ensures keys are uniformly distributed about the cluster. Participating nodes in a Riak cluster are homogeneous – meaning any node can service any request – and due to the nature of consistent hashing, every node in the cluster knows where data should reside within the cluster.
Riak’s default behavior is to seek equilibrium. Each node in a cluster is responsible for 1/nth data within the cluster and 1/nth total performance, where n represents total node count in a cluster. This architectural principle allows operators to make reasonable assumptions about Riak’s linear scalability. Often in sharded systems, access patterns that disproportionately access specific ranges of data will cause “hot spots” in a cluster making predictable operations more difficult to maintain. Uniform data distribution about the cluster (and Hinted Handoff described below) allows for continuous normal operations of an active Riak cluster while maintaining a predictable level of performance as any node is removed from the cluster for any reason. Even in resource-starved virtual environments, a Riak cluster will work to maintain its equilibrium and equal data distribution such that operators may assemble larger numbers of slower individual virtual machines to achieve their desired performance profile in aggregate. Additionally, each physical node in a Riak cluster maintains its own performance statistics that are easily accessible and parsable such that an advanced deployment would be able to wrap those statistics into a sick-node detection algorithm based on their own specific thresholds.
Furthering the predictable performance rationale, Riak uses vector clocks, specifically dotted version vectors in Riak 2.0, to internally reason about conflict resolution as it relates to multiple updates to the same object. Updates to any object are independent of updates to any other object and thus there is no locking in Riak for any of its core operations – reads, writes and deletes. There are no global or local locks on any tables or rows – those structures do not exist in Riak. Locks introduce nondeterministic delays in performance yet are often a necessary component of any absolutely consistent database. Riak on the other hand is an eventually consistent database (caveat Riak 2.0’s strong consistency feature at the expense of availability). Additionally, because the base level of abstraction in Riak is the vnode, and data is replicated amongst a set of vnodes operating on distinct physical hardware, background processes such as file compaction is scheduled in such a way that it only affects a segment of the cluster at any given time. This rolling compaction ensures a high degree of availability with minimal performance penalty.
High Availability and Failure Recovery
The measure of a distributed system is not how well that system runs under optimal conditions in the general case, but rather how that system performs in the face of failure. An architect must ask herself how well her system will perform in the face of node failure and how well that system will recover from failure. Failures happen with increasing frequency as the size of your system grows. Riak implements a number of technologies that when combined ensure Riak excels at failure recovery. These technologies provide a baseline set of features that allow Riak to quickly recover from failure scenarios with minimal operator intervention. Not only will these built-in recovery mechanisms maintain eventual consistency within the database but they will also maintain synchronicity with features such as Solr’s full text indexing and Multi Datacenter Replication.
Hinted Handoff ensures data is replicated an appropriate number of times in spite of failure by allowing a node to take over responsibility for a vnode and then return that data to its original “owner.” Handoff from one vnode to another can happen on a temporary (in the case of failure) or permanent (in the case of cluster resizing) basis. In both cases, Riak handles this automatically while the cluster remains available. Because Riak is able to dynamically allocate vnode assignments continuously, the cluster can absorb the loss of any physical node(s) for write operations and ensure availability of data as long as one vnode of any replica set is still accessible. This allows Riak to maintain availability when a node is removed from the cluster for any reason – whether scheduled or not. Ultimately, an unscheduled failure or a scheduled upgrade of Riak or the operating system results in the removal of a node from the cluster – Riak’s core architecture and capabilities accommodate this behavior with minimal operator intervention.
Read Repair is a mechanism triggered on a successful read of a value where all replicas may not agree on the value. There are two possibilities where this can happen: when one replica returns not found – meaning it doesn’t have a copy, and when one replica returns a value where the vector clocks is an ancestor of the successful read. When this situation occurs, Riak will force the errant nodes to update the object’s value based on the value of the successful read.
Finally, Active Anti Entropy corrects inconsistent data continuously in the background. Where Read Repair corrects data at read time, AAE corrects all data regardless of whether or not it is actively accessed by running a background process that continuously checks for inconsistencies. AAE uses Merkle Tree hash exchanges between vnodes to look for these inconsistencies. When a difference at the top of the tree is detected, Riak recursively checks the tree until it finds the exact values with a difference between vnodes and then sends the smallest amount of data necessary to regain equilibrium.
Architecture Makes a Difference
Riak’s core architecture and key technical features provide the building blocks valued in a highly distributed environment. Uniform data distribution, homogeneous nodes, hinted handoff, read repair and active anti entropy all play a role in providing the high availability, fault tolerance, predictable performance and operational simplicity developers and managers are looking for from their non-relational persisted data solutions.
We’ve seen Riak adopted across a wide variety of verticals and for a broad range of use cases. From gaming to retail to advertising to mobile, our customers begin by identifying a workload, or use case, where availability, scalability, and fault tolerance are critical. We then work closely with these customers to ensure Riak is an ideal fit for the architectural design and business requirements. This process often begins with a Tech Talk. Someone like me, working with you either onsite or remotely, to assess how Riak can help you solve your critical business requirements. You can sign up for a Tech Talk here.
March 26, 2014
Riak’s overarching design goal is simple: be maximally available. If your data center is on fire, Riak will be the last part of your stack to fail (and hopefully, you’ve purchased an enterprise license, so there’s another cluster in another data center ready to go at all times).
In order to make sure your data can survive server failures, Riak retains multiple copies (replicas) and allows lock-free, uncoordinated updates.
This then open ups the possibility that data will be out of sync across a cluster. Riak manages this issue in three distinct stages: entropy detection, correction, and conflict resolution.
Among the oldest and simplest tools in Riak is Read Repair, which, as its name implies, is triggered when a read request is received. If the server coordinating the operation notices that the servers responsible for the key do not agree on its value, correction is required.
A more recent feature in Riak is Active Anti-Entropy (often shortened to AAE). Effectively, this is the proactive version of read repair and runs in the background. Riak maintains hash trees to monitor for inconsistent data between servers; when divergent values are detected, correction is mandated.
As discussed in the blog post, Clocks Are Bad, Or, Welcome to the Wonderful World of Distributed Systems, automatically determining the “correct” value in the event of a conflict is not simple, and often not possible at the database layer.
Using contextual metadata called vector clocks, Riak will attempt to determine whether one of the discovered values is derived from the other. In that case, it can safely choose the most recent value. This value is written to all servers that have a copy of the data and conflict resolution is not required.
If Riak can’t verify such a causal relationship, things get more difficult.
Riak’s default behavior, is to fall back to server clocks to determine a winner. This can lead to unexpected results if concurrent updates are received but, on the positive side, conflict resolution will not be required.
If Riak is instead configured with
allow_mult=true to protect data integrity, even when independent writes are received, Riak will write both values to the servers as siblings for later conflict resolution.
Conflict resolution means that Riak detects a conflict, can’t resolve it intelligently, and isn’t instructed to resolve it otherwise.
Next time the application attempts to read such a value, instead of receiving one answer, it’s going to receive (at least) two. It is now the application’s responsibility to deal with the conflict and provide a new value back to Riak for future reads.
This can be trivial (pick one value), obvious (merge all values), or tricky (apply business logic and come back with something different).
With Riak 2.0, Basho is introducing Riak Data Types, which are designed to handle conflict resolution automatically. If your data can be formulated as a set or a map (not terribly dissimilar from JSON), Riak can process and resolve the siblings for you when a read request is received.
Many databases, particularly distributed ones, are effectively non-deterministic in the presence of concurrent writes. If they try to enforce consistency on writes, they sacrifice availability and often data integrity. If they don’t enforce consistency, they may rely on server (or worse, client) clocks to pick a winner, if they even have a strategy at all.
Riak is unique in encouraging developers to think about conflict resolution. Why? Because, for large distributed systems, network and server failures are a fact of life. For very large distributed systems, data duplication and inconsistency is unavoidable. Since Riak is designed for scale, it’s better to provide a structure for proper resolution than to pretend conflicts don’t exist.
January 27, 2014
On the official Basho docs, we compare Riak to multiple other databases. We are currently working on updating these comparisons, but, in the meantime, we wanted to provide a more up-to-date comparison for one of the more common questions we’re asked: How does Riak compare to Cassandra?
Cassandra looks the most like Riak out of any other widely-deployed data storage technology in existence. Cassandra and Riak have architectural roots in Amazon’s Dynamo, the system Amazon engineered to handle their highly available shopping cart service. Both Riak and Cassandra are masterless, highly available stores that persist replicas and handle failure scenarios through concepts such as hinted handoff and read-repair. However, there are certain key differences between the two that should be considered when evaluating them.
Amazon’s Dynamo utilized a Key/Value data model. Early in Cassandra’s development, a decision was made to diverge from keys and values toward a wide-row data model (similar to Google’s BigTable). This means Cassandra is a Key/Key/Value store, which includes the concept of Column Families that contain columns. With this model, Cassandra is able to handle high write volumes, typically by appending new data to a row. This also allows Cassandra to perform very efficient range queries, with the tradeoff being a more rigid data model since rows are “fixed” and non-sequential read operations often require several disk seeks.
On the other hand, Riak is a straight Key/Value store. We believe this offers the most flexibility for data storage. Riak’s schemaless design has zero restrictions on data type, so an object can be a JSON document at one moment and a JPEG at the next.
Like Cassandra, Riak also excels at high write volumes. Range queries can be a little more costly, though still achievable through Secondary Indexes. In addition, there are a number of data modeling tips and tricks for Riak that make it easy to expose access to data in ways that sometimes aren’t as obvious at first glance. Below are a few examples:
In Riak, multi-datacenter replication is achieved by connecting independent clusters, each of which own their own hash ring. Operators have the ability to manage each cluster and select all or part of the data to replicate across a WAN. Multi-datacenter replication in Riak features two primary modes of operation: full sync and real-time. Data transmitted between clusters can be encrypted via OpenSSL out-of-the-box. Riak also allows for per-bucket replication for more granular control.
Cassandra achieves replication across WANs by splitting the hash ring across two or more clusters, which requires operators to manually define a NetworkTopologyStrategy, Replication Factor, a Replication Placement Strategy, and a Consistency Level for both local and cross data center requests.
Conflict Resolution and Object Versioning
Cassandra uses wall clock timestamps to establish ordering. The resolution strategy in this case is Last Write Wins (LWW), which means that data may be overwritten when there is write contention. The odds of data loss are magnified by (inevitable) server clock drift. More details on this can be found in the blog, “Clocks are Hard, or, Welcome to the Wonderful World of Distributed Systems.”
Riak uses a data structure called vector clocks to track the causal ordering of updates. This per-object ancestry allows Riak to identify and isolate conflicts without using system clocks.
In the event of a concurrent update to a single key, or a network partition that leaves application servers writing to Riak on both sides of the split, Riak can be configured to keep all writes and expose them to the next reader of that key. In this case, choosing the right value happens at the application level, allowing developers to either apply business logic or some common function (e.g. merge union of values) to resolve the conflict. From there, that value can be written back to Riak for its key. This ensures that Riak never loses writes.
Riak Data Types, first introduced in Riak 1.4 and expanded in the upcoming Riak 2.0, are designed to converge automatically. This means Riak will transparently manage the conflict resolution logic for concurrent writes to objects.
In the event of server failures and network problems, Riak is designed to always accept read and write requests, even if the servers that are ordinarily responsible for that data are unavailable.
Cassandra will allow writes to (optionally) be stored on alternative servers, but will not allow that data to be retrieved. Only after the cluster is repaired and those writes are handed off to an appropriate replica server (with the potential data loss that timestamp-based conflict resolution implies, as discussed earlier) will the data that was written be available to readers.
Imagine a user working with a shopping cart when the application is unable to connect to the primary replicas. The user can re-add missing items to the cart but will never actually see the items show up in the cart (unless the application performs its own caching, which introduces more layers of complexity and points of failure).
When handling missing or divergent/stale data, Riak and Cassandra have many similarities. Both employ a passive mechanism where read operations trigger the repair of inconsistent replicas (known as read-repair). Both also use Active Anti-Entropy, which builds a modified Merkle tree to track changes or new inserts on a per hash-ring-partition basis. Since the hash rings contain overlapping keys, the trees are compared and any divergent or missing data is automatically repaired in the background. This can be incredibly effective at combating problems such as bitrot, since Active Anti-Entropy does not need to wait for a read operation.
The key difference in implementation is that Cassandra uses short-lived, in-memory hash trees that are built per Column Family and generated as snapshots during major data compactions. Riak’s trees are on-disk and persistent. Persistent trees are safer and more conducive to ensuring data integrity across much larger datasets (e.g. 1 billion keys could easily cost 8-16GB of RAM in Cassandra versus 8-16GB of disk in Riak).
Both Cassandra and Riak are eventually consistent, scalable databases that have strengths for specific use cases. Each has hundreds of thousands of hours of engineering invested and the commercial backing and support offered by their respective companies, Datastax and Basho. At Basho, we have labored to make Riak very robust and easy to operate at both large and small scale. For more information on how Riak is being used, visit our Riak Users page. For a look at what’s to come, download the Technical Preview of Riak 2.0.
November 13, 2013
This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.”
One of the biggest differences between Riak and relational systems is our focus on availability. Riak is designed to be deployed to, and runs best on, multiple servers. It can continue to function normally in the presence of hardware and network failures. Relational databases, conversely, are simplest to set up on a single server.
Most relational databases offer a master/slave architecture for availability, in which only the master server is available for data updates. If the master fails, the slave is (hopefully) able to step in and take over.
However, even with this simple model, coping with failure (or even properly defining it) is non-trivial. What happens if the master and slave server cannot talk to each other? How do you recover from a split brain scenario, where both servers think they’re the master and accept updates? What happens if the slave is slow to respond to updates sent from the master database? Can clients read from a slave? If so, does the master need to verify that the slave has received all updates before it commits them locally and responds to the client that requested the updates?
Conversely, Riak is explicitly designed to expect server and network failure. Riak is a masterless system, meaning any server can respond to read or write requests. If one fails, others will continue to service client requests. Once this server becomes available again, the cluster will feed it any updates that it missed through a process we call hinted handoff.
Because Riak’s system allows for reads and writes when multiple servers are offline or otherwise unreachable, data may not always be consistent across the environment (usually only for a few milliseconds). However, through self-healing mechanisms like read repair and Active Anti-Entropy, all updates will propagate to all servers making data eventually consistent.
For many use cases, high availability is more important than strict consistency. Data unavailability can negatively impact revenue, damage user trust, lead to poor user experience, and cause lost critical data. Industries like gaming, mobile, retail, and advertising require always-on availability. Visit our Users Page to see how companies in various industries use Riak.
February 21, 2013
Today we are excited to announce the latest version of Riak. Here is a summary of the major enhancements delivered in Riak 1.3:
- Introduced Active Anti-Entropy. Riak now has active anti-entropy. In distributed systems, inconsistencies can arise between replicas due to failure modes, concurrent updates, and physical data loss or corruption. Pre-1.3 Riak already had several features for repairing this “entropy”, but they all required some form of user intervention. Riak 1.3 introduces automatic, self-healing properties that repair entropy on an ongoing basis.
- Improved Riak Enterprise’s multi-datacenter replication performance. New advanced mode for multi-datacenter replication capabilities, with better performance, more TCP connections and easier configuration. Read more in this write up from GigaOM.
- Improved graphical user experience. Riak Control, the user interface for managing and monitoring Riak, has a brand new look.
- Expanded IPv6 support. IPv6 support in Riak now is supported by all interfaces.
- Improved MapReduce. Riak MapReduce has improved back-pressure to reduce the risk of overwhelming endpoint processes during large tasks.
- Simplified log management. Riak can now optionally send log messages to syslog.
Ready to get started or upgrade? Download the new release here, check out the official release notes, or read on for more details. Documentation for all products and releases is available on the documentation site. For an introduction to Riak and what’s new in Riak 1.3, sign up for our webcast on Thursday, March 7.
More on What’s in Riak 1.3
A key feature of Riak is its ability to regenerate lost or corrupted data from replicated data stored on other nodes. Prior to this release, Riak provided two methods to repair data:
- Read Repair: Riak compares the replies from all replicas during a read request, repairing any replica that is divergent or missing data. (K/V data only)
- Repair Command via Riak Console: Introduced in Riak 1.2, the repair command enables users to trigger a repair of a specific partition. The partition is rebuilt based on a subset of data stored on adjacent nodes in the Riak ring. All data is rebuilt, not just missing or divergent data. (K/V and Search data)
Riak 1.3 introduces active anti-entropy, a continuous background process that compares and repairs any divergent, missing, or corrupted replicas (K/V data only). Unlike read repair, which is only triggered when data is read, the active anti-entropy system ensures the integrity of all data stored in Riak. This is particularly useful in clusters containing “cold data”: data that may not be read for long periods of time, potentially years. Furthermore, unlike the repair command, active anti-entropy is an automatic process, requiring no user intervention and is enabled by default in Riak 1.3.
Riak’s active anti-entropy feature is based on hash tree exchange, which enables differences between replicas to be determined with minimal exchange of information. Specifically, the amount of information exchanged in the process is proportional to the differences between two replicas, not the amount of data that they contain. Approximately the same amount of information is exchanged when there are 10 differing keys out of 1 million keys as when there are 10 differing keys out of 10 billion keys. This enables Riak to provide continuous data protection regardless of cluster size.
Additionally, Riak uses persistent, on-disk hash trees rather than purely in-memory trees, a key difference from similar implementations in other products. This allows Riak to maintain anti-entropy information for billions of keys with minimal additional memory usage, as well as allows Riak nodes to be restarted without losing any anti-entropy information. Furthermore, Riak maintains the hash trees in real time, updating the tree as new write requests come in. This reduces the time it takes Riak to detect and repair missing/divergent replicas. For added protection, Riak periodically (default: once a week) clears and regenerates all hash trees from the on-disk K/V data. This enables Riak to detect silent data corruption to the on-disk data arising from bad disks, faulty hardware components, etc.
New Look for Riak Control
Riak Control is a UI for managing and monitoring your Riak cluster. Riak Control lets you start and re-start Riak nodes, view a “health check” for your cluster, see all nodes and their current status, and have visibility into their partitions and services. Riak Control now has a brand new look and feel. Check out the Riak Control Github page to get up and running.
Expanded IPv6 Support
While Riak’s HTTP interface has always supported IPv6, not all of its interfaces have been as current. In Riak 1.3, the protocol buffers interfaces can now listen on IPv6 or IPv4 addresses. Riak handoff (which is responsible for data transfer when nodes are added or removed, and for handing off update responsibilities when nodes fail) also supports IPv6. It should also be noted that community member Tom Lanyon started the work on this feature. Thanks, Tom!
Improved Backpressure in Riak MapReduce
Riak Enterprise: Advanced Multi-Datacenter Replication Capabilities
With hundreds of companies using Riak Enterprise, a commercial extension of Riak, we’ve been lucky to work with many teams pushing the limits of multi-datacenter replication performance and resiliency. We’ve learned a lot and are excited to announce these capabilities are now available in advanced mode.
- Previously, multi-datacenter replication had one TCP connection over which data was streamed from one cluster to another. This could create a performance bottleneck, especially when run on nodes constrained by per-instance bandwidth limits, such as in a cloud environment. In the new version of multi-datacenter replication, multiple concurrent TCP connections (approximately one per physical node) and processes are used between sites.
- Configuration of multi-datacenter replication is easier. Use a shell command to name your clusters, then connect both clusters using a simple ip:port combination.
- Better per-connection statistics for both full-sync and real-time modes.
- New ability to tweak full-sync workers per node and per cluster, allowing customers to dial-in performance.
The new replication improvements are already used in production by customers and yielding significant performance improvements. For now, the new replication technology is available in advanced mode: it’s optional to turn on. It currently doesn’t have all of the features of the default mode – including SSL, NAT support and full-sync scheduling. Both default and advanced modes are available in the 1.3 release and function independently. In the future, “advanced mode” will become the default.
For more details about multi-datacenter replication, download our whitepaper, “Multi-Datacenter Replication: A Technical Overview.”