Tag Archives: read repair

Entropy in Riak

March 26, 2014

Riak’s overarching design goal is simple: be maximally available. If your data center is on fire, Riak will be the last part of your stack to fail (and hopefully, you’ve purchased an enterprise license, so there’s another cluster in another data center ready to go at all times).

In order to make sure your data can survive server failures, Riak retains multiple copies (replicas) and allows lock-free, uncoordinated updates.

This then open ups the possibility that data will be out of sync across a cluster. Riak manages this issue in three distinct stages: entropy detection, correction, and conflict resolution.

Entropy Detection

Among the oldest and simplest tools in Riak is Read Repair, which, as its name implies, is triggered when a read request is received. If the server coordinating the operation notices that the servers responsible for the key do not agree on its value, correction is required.

A more recent feature in Riak is Active Anti-Entropy (often shortened to AAE). Effectively, this is the proactive version of read repair and runs in the background. Riak maintains hash trees to monitor for inconsistent data between servers; when divergent values are detected, correction is mandated.

Correction

As discussed in the blog post, Clocks Are Bad, Or, Welcome to the Wonderful World of Distributed Systems, automatically determining the “correct” value in the event of a conflict is not simple, and often not possible at the database layer.

Using contextual metadata called vector clocks, Riak will attempt to determine whether one of the discovered values is derived from the other. In that case, it can safely choose the most recent value. This value is written to all servers that have a copy of the data and conflict resolution is not required.

If Riak can’t verify such a causal relationship, things get more difficult.

Riak’s default behavior, is to fall back to server clocks to determine a winner. This can lead to unexpected results if concurrent updates are received but, on the positive side, conflict resolution will not be required.

If Riak is instead configured with allow_mult=true to protect data integrity, even when independent writes are received, Riak will write both values to the servers as siblings for later conflict resolution.

Conflict Resolution

Conflict resolution means that Riak detects a conflict, can’t resolve it intelligently, and isn’t instructed to resolve it otherwise.

Next time the application attempts to read such a value, instead of receiving one answer, it’s going to receive (at least) two. It is now the application’s responsibility to deal with the conflict and provide a new value back to Riak for future reads.

This can be trivial (pick one value), obvious (merge all values), or tricky (apply business logic and come back with something different).

With Riak 2.0, Basho is introducing Riak Data Types, which are designed to handle conflict resolution automatically. If your data can be formulated as a set or a map (not terribly dissimilar from JSON), Riak can process and resolve the siblings for you when a read request is received.

Why?

Many databases, particularly distributed ones, are effectively non-deterministic in the presence of concurrent writes. If they try to enforce consistency on writes, they sacrifice availability and often data integrity. If they don’t enforce consistency, they may rely on server (or worse, client) clocks to pick a winner, if they even have a strategy at all.

Riak is unique in encouraging developers to think about conflict resolution. Why? Because, for large distributed systems, network and server failures are a fact of life. For very large distributed systems, data duplication and inconsistency is unavoidable. Since Riak is designed for scale, it’s better to provide a structure for proper resolution than to pretend conflicts don’t exist.

John Daily

Riak vs. Cassandra – A Brief Comparison

January 27, 2014

On the official Basho docs, we compare Riak to multiple other databases. We are currently working on updating these comparisons, but, in the meantime, we wanted to provide a more up-to-date comparison for one of the more common questions we’re asked: How does Riak compare to Cassandra?

Cassandra looks the most like Riak out of any other widely-deployed data storage technology in existence. Cassandra and Riak have architectural roots in Amazon’s Dynamo, the system Amazon engineered to handle their highly available shopping cart service. Both Riak and Cassandra are masterless, highly available stores that persist replicas and handle failure scenarios through concepts such as hinted handoff and read-repair. However, there are certain key differences between the two that should be considered when evaluating them.

Data Model

Amazon’s Dynamo utilized a Key/Value data model. Early in Cassandra’s development, a decision was made to diverge from keys and values toward a wide-row data model (similar to Google’s BigTable). This means Cassandra is a Key/Key/Value store, which includes the concept of Column Families that contain columns. With this model, Cassandra is able to handle high write volumes, typically by appending new data to a row. This also allows Cassandra to perform very efficient range queries, with the tradeoff being a more rigid data model since rows are “fixed” and non-sequential read operations often require several disk seeks.

On the other hand, Riak is a straight Key/Value store. We believe this offers the most flexibility for data storage. Riak’s schemaless design has zero restrictions on data type, so an object can be a JSON document at one moment and a JPEG at the next.

Like Cassandra, Riak also excels at high write volumes. Range queries can be a little more costly, though still achievable through Secondary Indexes. In addition, there are a number of data modeling tips and tricks for Riak that make it easy to expose access to data in ways that sometimes aren’t as obvious at first glance. Below are a few examples:

Multi-Datacenter Replication

In Riak, multi-datacenter replication is achieved by connecting independent clusters, each of which own their own hash ring. Operators have the ability to manage each cluster and select all or part of the data to replicate across a WAN. Multi-datacenter replication in Riak features two primary modes of operation: full sync and real-time. Data transmitted between clusters can be encrypted via OpenSSL out-of-the-box. Riak also allows for per-bucket replication for more granular control.

Cassandra achieves replication across WANs by splitting the hash ring across two or more clusters, which requires operators to manually define a NetworkTopologyStrategy, Replication Factor, a Replication Placement Strategy, and a Consistency Level for both local and cross data center requests.

Conflict Resolution and Object Versioning

Cassandra uses wall clock timestamps to establish ordering. The resolution strategy in this case is Last Write Wins (LWW), which means that data may be overwritten when there is write contention. The odds of data loss are magnified by (inevitable) server clock drift. More details on this can be found in the blog, “Clocks are Hard, or, Welcome to the Wonderful World of Distributed Systems.”

Riak uses a data structure called vector clocks to track the causal ordering of updates. This per-object ancestry allows Riak to identify and isolate conflicts without using system clocks.

In the event of a concurrent update to a single key, or a network partition that leaves application servers writing to Riak on both sides of the split, Riak can be configured to keep all writes and expose them to the next reader of that key. In this case, choosing the right value happens at the application level, allowing developers to either apply business logic or some common function (e.g. merge union of values) to resolve the conflict. From there, that value can be written back to Riak for its key. This ensures that Riak never loses writes.

For more information, visit Basho blog posts on Why Vector Clocks are Easy and Why Vector Clocks are Hard.

Riak Data Types, first introduced in Riak 1.4 and expanded in the upcoming Riak 2.0, are designed to converge automatically. This means Riak will transparently manage the conflict resolution logic for concurrent writes to objects.

Availability

In the event of server failures and network problems, Riak is designed to always accept read and write requests, even if the servers that are ordinarily responsible for that data are unavailable.

Cassandra will allow writes to (optionally) be stored on alternative servers, but will not allow that data to be retrieved. Only after the cluster is repaired and those writes are handed off to an appropriate replica server (with the potential data loss that timestamp-based conflict resolution implies, as discussed earlier) will the data that was written be available to readers.

Imagine a user working with a shopping cart when the application is unable to connect to the primary replicas. The user can re-add missing items to the cart but will never actually see the items show up in the cart (unless the application performs its own caching, which introduces more layers of complexity and points of failure).

Data Integrity

When handling missing or divergent/stale data, Riak and Cassandra have many similarities. Both employ a passive mechanism where read operations trigger the repair of inconsistent replicas (known as read-repair). Both also use Active Anti-Entropy, which builds a modified Merkle tree to track changes or new inserts on a per hash-ring-partition basis. Since the hash rings contain overlapping keys, the trees are compared and any divergent or missing data is automatically repaired in the background. This can be incredibly effective at combating problems such as bitrot, since Active Anti-Entropy does not need to wait for a read operation.

The key difference in implementation is that Cassandra uses short-lived, in-memory hash trees that are built per Column Family and generated as snapshots during major data compactions. Riak’s trees are on-disk and persistent. Persistent trees are safer and more conducive to ensuring data integrity across much larger datasets (e.g. 1 billion keys could easily cost 8-16GB of RAM in Cassandra versus 8-16GB of disk in Riak).

Summary

Both Cassandra and Riak are eventually consistent, scalable databases that have strengths for specific use cases. Each has hundreds of thousands of hours of engineering invested and the commercial backing and support offered by their respective companies, Datastax and Basho. At Basho, we have labored to make Riak very robust and easy to operate at both large and small scale. For more information on how Riak is being used, visit our Riak Users page. For a look at what’s to come, download the Technical Preview of Riak 2.0.

Tom Santero

Relational to Riak – High Availability

November 13, 2013

This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.”


One of the biggest differences between Riak and relational systems is our focus on availability. Riak is designed to be deployed to, and runs best on, multiple servers. It can continue to function normally in the presence of hardware and network failures. Relational databases, conversely, are simplest to set up on a single server.

Most relational databases offer a master/slave architecture for availability, in which only the master server is available for data updates. If the master fails, the slave is (hopefully) able to step in and take over.

However, even with this simple model, coping with failure (or even properly defining it) is non-trivial. What happens if the master and slave server cannot talk to each other? How do you recover from a split brain scenario, where both servers think they’re the master and accept updates? What happens if the slave is slow to respond to updates sent from the master database? Can clients read from a slave? If so, does the master need to verify that the slave has received all updates before it commits them locally and responds to the client that requested the updates?

Conversely, Riak is explicitly designed to expect server and network failure. Riak is a masterless system, meaning any server can respond to read or write requests. If one fails, others will continue to service client requests. Once this server becomes available again, the cluster will feed it any updates that it missed through a process we call hinted handoff.

Because Riak’s system allows for reads and writes when multiple servers are offline or otherwise unreachable, data may not always be consistent across the environment (usually only for a few milliseconds). However, through self-healing mechanisms like read repair and Active Anti-Entropy, all updates will propagate to all servers making data eventually consistent.

For many use cases, high availability is more important than strict consistency. Data unavailability can negatively impact revenue, damage user trust, lead to poor user experience, and cause lost critical data. Industries like gaming, mobile, retail, and advertising require always-on availability. Visit our Users Page to see how companies in various industries use Riak.

Basho