May 10, 2013
This is the last part of a 4-part series on configuring Riak’s oft-subtle behavioral characteristics
For my final post, I’ll tackle two very different objectives and show how to choose configuration values to support them.
Fast and sloppy
Let’s say you’re reading and writing statistics that don’t need to be perfectly accurate. For example, if a website has been “liked” 300 times, it’s not the end of the world if that site reports 298 or 299 likes, but it absolutely is a problem if the page loads slowly because it takes too long to read that value.
We want to maximize performance by responding to the client as soon as we have read a value or confirmed a write.
Before Riak 1.3.1, this would have been implied by
W=1, but that is no longer true. In any event, better to be explicit: we only need one vnode to send the data to its backend before the client receives a response.
We don’t need to spend time pulling and updating vector clocks; just write the latest value as quickly as possible.
Not so obvious
One problem with
R=1 is that
notfound_ok=true by default. This means that if the first vnode to respond doesn’t have a copy, the failure to find a value will be treated as authoritative and thus the client will receive a
If a webpage has been liked 5000 times, but one of the primary nodes has fallen over and a failover node is in place without a complete data set, you don’t want to report 0 likes.
notfound_ok=false instructs the coordinating node to wait for something other than a
notfound error before reporting a value.
However, waiting for all 3 vnodes to report that there is no such data point may be overkill (and too much of a hit on performance), so you can use
basic_quorum=true to short circuit the process and report back a
notfound status after only 2 replies.
Playing with fire
This is not for the faint of heart, but if you can accept the fact that that you will occasionally lose access to data if two nodes are unavailable, this will reduce the cluster traffic and thus potentially improve overall system performance. This is definitely not a recommended configuration.
Strict, slow, and failure-prone
At the other end of the spectrum, if you want to favor consistency at the expense of availability, you can certainly do so.
Read Your Own Writes (RYOW), as we discussed in an earlier post. Be prepared for more frequent failures, however, since the cluster will not be at liberty to distribute reads and writes as widely as possible.
If a conflict somehow does occur, give all of the values to the application to resolve.
Not so obvious
By retaining the vector clocks for deleted objects, we enhance the overall data integrity of the database over time and sidestep problems with resurrected deleted objects as we saw in the previous blog post.
This comes at a price: more disk space will be used to retain the old objects, and there will be many more tombstones for clients to recognize and ignore. (You have written your code to cope with tombstones and even tombstone siblings, haven’t you?)
I hope this series of blog posts has helped answer some of the mysteries, and will help you avoid some stumbling blocks, involved with running Riak.
If you occasionally wonder why relational databases have been around for so many decades and still have a hard time scaling horizontally, I suggest you think back to the tradeoffs and hard questions posed herein and consider this: all of this is to keep a simple key/value store running fast and answering questions as accurately as possible given the constraints imposed by inevitable hardware and network failure.
Imagine how much harder this is for a relational database with its dramatically more complex storage and retrieval model.
Distributed systems are not your father’s von Neumann machine.
If you’d like to simulate production load or experiment with various failure modes, here are tools which may be of assistance.
Basho Bench is a benchmarking tool that can generate repeatable performance tests.
Quick summary of configuration parameters
The effective default value is listed for each value. Most of these are also covered in our online documentation.
As mentioned in part 2, the capitalized parameters I’ve been using are for aesthetic purposes, and the real values (e.g.,
n_val in place of
dw in place of
DW) are shown below.
Reading and writing
- All of the remaining values under this heading can be no larger than
- Value is in milliseconds, hence 3 seconds by default
- Alternative values are
- Brewer’s CAP theorem (Wikipedia)
- Brewer’s article on the CAP theorem, 12 years later
- Basho’s CRDT project
- Original CRDT research paper (PDF)
- Eventual consistency (Riak docs)
- Vector clocks (Riak docs)
- Dynamo paper with annotations for Riak’s architecture
- Overview of Riak’s concepts
- Riak Glossary
- Failure scenarios in Riak, from Ryan Zezeski’s Try-Try-Try project
May 9, 2013
This is the penultimate blog post in our look at Riak configuration parameters and associated behaviors, particularly the less obvious implications thereof.
Of success and failure
It is important to understand that failure can still result in success, success can result in failure, and, well, distributed systems are hard.
A successful failure
Imagine that a primary vnode is unavailable and a write request with
PW=3 is dispatched.
Even though the client will be informed of a write failure, the reachable primary vnodes received and will still write the new data. When the missing primary vnode returns to service, it will also receive the new data via read repair or active anti-entropy.
Thus, reads will see the “failed” write.
A failed success
As mentioned earlier, Riak will attempt to oblige a request even when vnodes which ordinarily would manage that data are unavailable. This leads to a situation called a sloppy quorum, in which other vnodes will handle reads or writes to comply with the specified
This can lead to unexpected behavior.
Imagine this sequence of events:
- One of the primary vnodes for a given key becomes unavailable
- The key/value pair is copied to a secondary vnode
- The primary vnode comes back online
- A request arrives to delete the key; all primary vnodes acknowledge
- The tombstones (see below) are removed
- The same primary vnode fails again
- A request arrives for the deleted key
- Because the secondary vnode for that data doesn’t know about the
deletion, it replies with the old data
- Read repair causes the old data to be distributed to the primary vnodes
Voilà, deleted data is resurrected.
Tombstones, tombstones, and delete_mode
In addition to the “failed success” scenario above, it is possible to see deleted objects resurrected when synchronizing between multiple datacenters, especially when using older versions of Riak Enterprise and multi datacenter replication (MDC) in environments where connectivity between the datacenters can be spotty.
These cases of resurrected deleted data can be avoided by retaining the tombstones (and the all-important vector clocks) via the
delete_mode configuration parameter.
Deleting an object in a distributed data store is distinctly non-trivial, and in Riak it requires several steps. If you don’t need to delete objects, you should consider refraining from doing so.
Here is the sequence of events that take place when a key is deleted.
- A client requests that the object be deleted
- Note: all
W/etc parameters must be met to allow a deletion request to succeed
- The existing vector clock is updated and stored with the tombstone
X-Riak-Deleted=truemetadata is added to the object for both internal record-keeping and external requests
delete_modeis set to
keep, no further action is taken. The tombstone will remain in the database, although it cannot be retrieved with a simple GET operation
delete_modeis set to an integer value (in milliseconds) the backend will be instructed after that period of time to delete the object
- This is the standard path; the configuration value is 3000 (hence 3 seconds) by default
delete_modeis set to
immediateor the time interval has passed, and all of these criteria are met, the backend is asked to delete the object
- No client has written to the same key in the interim
- All primary vnodes are available
- All primary vnodes report the same tombstone
- This is not the same as a Riak tombstone
Important: Riak tombstones will appear in MapReduce, Riak Pipe, and key list operations; even if you do not set
keep, you should be aware of these occasional interlopers (check for the
X-Riak-Deleted metadata in the object).
Caveat: There is currently a bug when requesting tombstoned objects via HTTP. The response will be a 404 status code with a vector clock header, but no
X-Riak-Deleted header. A patch is available but has not yet been applied.
Deleting and replacing
If you delete a key and wish to write to it again, it is best to retrieve any existing vector clock for that key to use for the new write, else you may end up with tombstone siblings (if
allow_mult=true) or even see tombstones replace the new value.
Since you may never be fully aware of what other clients are doing to your database, if you can afford the performance impact it is advisable to always request a key and attach the vector clock before writing data.
When using protocol buffers, make certain that
deletedvclock in your object request is set to
true in order to receive any tombstone vector clock.
As I’ve discussed, the act of deleting objects and their corresponding vector clocks leads to challenges with eventual consistency. Additionally, there are performance implications when reading non-existent keys, and corresponding configuration toggles to help manage the impact.
Waiting for every vnode with responsibility for a given key to respond with
notfound (thus indicating that the key does not exist on that vnode) can add undesirable latency. If your environment is optimized for fast reads over consistency by using
R=1, waiting for all 3 vnodes to reply is not what you signed up for.
The first toggle is an optimization Basho incorporated into early Riak and later converted to a configuration parameter named
basic_quorum. This setting has a very narrow scope: if set to
R=1 read requests will report a
notfound back to the client after only 2 vnodes reply with
notfound values instead of waiting for the 3rd vnode.
The default value is
notfound_ok configuration value was added. It has a much more profound impact on Riak’s behavior when keys are not present.
notfound_ok=true (the default) then a
notfound response from a vnode is treated as a positive assertion that the key does not exist, rather than as a failure condition.
This means that when
notfound_ok=true (regardless of the
basic_quorum value) if the first vnode replies with a
notfound message the client will also receive a
notfound_ok=false, then the coordinating node will wait for enough vnodes to reply with
notfound messages to know that it cannot fulfill the requested
R value. So, if
N=3, then 2 negative responses are enough to report
notfound back to the client.
Note: This in no way impacts read repair. If it turns out that one of the other vnodes has a value for that key, read repair will handle the distribution of that data appropriately for future reads.
In the worst case, where only the last vnode to reply has a value for a given key, the table below indicates the number of consecutive vnode
notfound messages that will be returned before the coordinating node will reply with
notfound to the client.
Any cell with 3 indicates that the client will receive the value from that 3rd vnode; any other scenario results in a
Broadly speaking, if you forget everything you’ve just read and trust Riak’s defaults, you should get the behavior you expect along with reasonable performance. With the introduction of active anti-entropy, there should not be many situations (other than during recovery from hardware/network failure) where multiple vnodes do not know about a valid key.
In our final post, I’ll take what we’ve learned and create configuration bundles to emphasize performance or data consistency.
I’ll also mention a couple of ways to perform your own stress tests to see how Riak behaves under normal (or abnormal) conditions.
May 8, 2013
This is part 2 of our 4-part series illustrating how Riak’s configuration options impact its core behaviors.
Part 1 covered background material and conflict resolution. In this post I’ll talk about reads and writes.
N, R, W, and hangers-on
All documentation lies by omission: there’s always something left out, deliberately or no.
These blog posts actively lie to you! There is no configuration parameter named
W or any of the other 1- or 2-character capitalized parameters you’ll see below.
Instead, the real parameters are
This is purely an aesthetic choice: I find the text easier to scan when the values stand out as capital letters. (And
n_val just drives me batty.)
For each of the parameters I’ll describe in this section, the maximum value is
N, the number of times which each piece of data is replicated through the cluster.
N is specified globally, but can be redefined for each bucket.
The default value for
N in Riak is 3, so I’ll assume
N=3 for all of the examples below.
It is possible to use names instead of integers for the configuration parameters:
all– All vnodes currently responsible for the replicated data must reply
- Equivalent to setting the value to match
one– Only 1 vnode need reply
quorum– A majority (half plus one) of the replicas must respond
default– Use the bucket (or global) default value
- In the absence of configuration changes, this will be the same as quorum
Readin’ and Writin’ (R and W)
N, the two most commonly-discussed parameters in Riak overviews are
These are, simply enough, the number of vnodes which must respond to a read (
R) or write (
W) request before the request is considered successful and a response is sent to the requesting client.
The request will be sent to all
N vnodes that are known to be currently responsible for the data, and the coordinating node which is handling the request on behalf of the client will wait for all
N vnodes to reply, but the client will be informed of a result after
These choices have implications that may not be immediately obvious:
- Data accuracy is enhanced if the coordinating node waits for all
Nresponses before replying – the last vnode to reply may have more recent data.
- Client responsiveness is degraded (possibly dramatically so in a situation where a node is failing) if the coordinating node waits for all
- The client will receive a failure message if we ask for
W=N) responses but one of the vnodes replies with an error.
The above facts lead to an unfortunate conclusion: at this point in time, there is no way to ask Riak for the best possible value in a single read request.
Primaries (PR and PW)
In the Riak model, there are
N vnodes with primary responsibility for any given key; these are termed the primary vnodes.
However, because Riak is optimized for availability, the database will use failover vnodes to comply with the
W parameters as necessitated by network or system failure.
By using the
PR (primary read) or
PW (primary write) configuration parameters with values greater than zero, Riak will only report success to the client if at least that number of primary vnodes reply with a successful outcome.
The downside is that requests are more likely to fail because the primary vnodes are unavailable.
As we’ll discuss in the next post, a failed
PW write request can (and typically will) still succeed (eventually); what matters to Riak when responding to the client is that it cannot assure the client of the write’s success unless the primary vnodes respond affirmatively.
As you can probably tell by the caveats and corner cases littering this document, Riak is cautious about making guarantees on data consistency.
However, as of Riak 1.3.1, it is safe to make one assertion: in the absence of other clients attempting to perform a write against the same key, if one client successfully executes a write request and then successfully executes a read request, and if for those requests
PW + PR > N, then the value retrieved will be the same as the value just written.
This is termed Read Your Own Writes consistency (RYOW).
Even this guarantee has a loophole. If all of the primary vnodes fail at the same time, the new values will no longer be available to clients; if they fail in such a way that the data has not yet been durably written to disk, the values could be lost forever.
Riak is not immune to truly catastrophic scenarios.
Durable writes (DW)
At the cost of adding latency to write requests, the
DW value can be set to a value between 1 and
N to require that the key and value be written to the storage backend on that number of nodes before the request is acknowledged.
W and DW, oh my
By default, DW is set to quorum, and will be treated as 1 internally even if configured to be 0.
In versions of Riak prior to 1.3.1, if W is set to a value less than DW, the DW value would be implicitly demoted to W. This means that setting W=1 would result in DW=1 as well, which is a reasonable performance optimization and likely what a user would expect.
However, if the request indicated DW=3 while W=2 (the default), this optimization is much less desirable and not at all expected behavior.
In order to make the overall behavior much more explicit with v1.3.1, the effective DW value is no longer demoted to match W, so any requests using W=1 to shorten the request time should also explicitly set DW=1, or the performance will suffer significantly.
Delete quorum (RW)
Deleting a key requires the successful read of its value and successful write of a tombstone (we’ll talk a lot about tombstones in our next blog post), complying with all
PW parameters along the way.
If somehow both
W are undefined (which is probably not possible in recent releases of Riak), the
RW parameter will substitute for read and write values during object deletes. If you want to test someone on obscure Riak facts, ask about
(It should be no surprise to learn this parameter should already have been, and soon will be, deprecated.)
To recap much of what we’ve covered to this point, I’ll walk through some typical reads and writes and the handling of possible inconsistencies.
For these scenarios, all 3 vnodes responsible for the key are available. The same behavior should be exhibited if a vnode is down,
but the response may be delayed as the cluster detects the failure and sends the request to another vnode.
W=2 for each of these.
A normal read request (to retrieve the value associated with a key) is sent to all 3 primary vnodes.
|All 3 vnodes agree on the value||Once first 2 vnodes return the value, it is returned to the client|
|2 of 3 vnodes agree on the value, and those 2 are the first to reach the coordinating node||The value is returned returned to the client. Read repair will deal with the conflict per the later scenarios, which means that a future read may return a different value or siblings|
|2 conflicting values reach the coordinating node and vector clocks allow for resolution||The vector clocks are used to resolve the conflict and return a single value, which is propagated via read repair to the relevant vnodes|
|2 conflicting values reach the coordinating node, vector clocks indicate a fork in the object history, and
||The object with the most recent timestamp is returned and propagated via read repair to the relevant vnodes|
|2 siblings or conflicting values reach the coordinating node, vector clocks indicate a fork in the object history, and
||All keys are returned as siblings, optionally with associated values|
Now, a write request.
|A vector clock is included with the write request, and is newer than the vclock attached to the existing object||The new value is written and success is indicated as soon as 2 vnodes acknowledge the write|
|A vector clock is included with the write request, but conflicts with the vclock attached to the existing object, and
||The new value is created as a sibling for future reads|
|A vector clock is included with the write request, but conflicts with (or is older than) the vclock attached to the existing object, and
||The new value overwrites the old|
|A vector clock is not included with the write request, an object already exists, and
||The new value is created as a sibling for future reads|
|A vector clock is not included with the write request, an object already exists, and
||The new value overwrites the existing value|
In our next post, I’ll discuss scenarios where Riak’s eventual consistency can cause unexpected behavior. I’ll also tackle object deletions, a surprisingly complex topic.
May 7, 2013
This is part 1 of a 4-part series on subtleties and tradeoffs to consider when configuring write and read parameters for a Riak cluster.
The full implications of the configuration options discussed here are rarely obvious and often are revealed only under a production load, hence this series.
More generally, I hope these documents serve to help illuminate some of the complexities involved when creating distributed systems. Data consistency on a single computer is (usually) straightforward; it is a different story altogether when 5, 10, or 100 servers share that responsibility.
What you should know first
This series is intended as a reasonably deep dive into the behavioral characteristics of Riak, and thus assumes that the reader has at least a passing familiarity with these key Riak concepts:
If you’re not comfortable with those topics, the Riak Fast Track is highly recommended reading, and if you encounter vocabulary or concepts that are particularly challenging, the following links should be helpful:
I’ll cover a few key concepts as an introduction/refresher.
Consistency, eventual or otherwise
As Eric Brewer’s CAP theorem established, distributed systems have to make hard choices. Network partition is
inevitable. Hardware failure is inevitable. When a partition occurs, a well-behaved system must choose its behavior from a spectrum of options ranging from “stop accepting any writes until the outage is resolved” (thus maintaining absolute consistency) to “allow any writes and worry about consistency later” (to maximize availability).
Riak is designed to favor availability, but allows read and write requests to be tuned on the fly to sacrifice availability for increased consistency, depending on the business needs of the data.
The concept of eventual consistency is the topic of many academic papers and conference talks, and is a vital part of the Riak story. See our page on eventual consistency for more information.
Because Riak operates on the assumption that networks and systems fail, it has automated processes to clean up inconsistent data after such an event.
The key data structure to do so with Riak today is the vector clock, which I’ll describe shortly.
Historically Riak has relied on read repair, a passive anti-entropy mechanism for performing cleanup whenever a key is requested. It assembles the best answer for any given read request and makes certain that value is shared among the vnodes which should have it.
With the 1.3 release, Basho has added a new active anti-entropy (AAE) feature to handle such repair
activities independently of read requests, thus reducing the odds of outdated values being reported to clients.
Vector clocks are critical pieces of metadata that help Riak identify temporal relationships between changes. In a distributed system it is neither possible nor necessarily useful to establish an absolute ordering of events.
One behavioral toggle with a broad impact that will not be evaluated in these documents is
vnode_vclocks, which determines whether vector clocks are tied to client identifiers (Riak’s behavior prior to 1.0) or virtual node counters (standard behavior from 1.0 onward).
true (the now-standard behavior, which we’ll assume throughout this series) has a slight negative impact on performance but helps keep the number of siblings under control.
Siblings are Riak’s way of ducking responsibility for making decisions about conflicting data when there’s no obvious way to judge which value is “correct” based on the history of writes, and when Riak is not configured to simply choose the last written value.
Keep in mind that as far as Riak’s key/value store is concerned, the values are opaque. If your application can compare two values and find a way to merge them, you are encouraged to incorporate that logic, but Riak will not do that for you.
Sibling management adds overhead both to the client and the database, but if you want to always have the best data available, that overhead is unavoidable.
(Unavoidable, that is, until Convergent/Commutative Replicated Data Types (CRDTs) are available for production use. See Basho’s
riak_dt project for more information.)
Riak currently has several layers of configuration:
- Defaults embedded in the source code
- Environment variables
- Configuration files
- Client software (by manipulating bucket properties)
- Capability system
Of particular interest is the Riak capability system, which is complex and hasn’t been well-communicated.
In short, the capability system is designed to help make upgrades run more smoothly by allowing nodes to negotiate when it is appropriate to start using a new or changed feature.
For example, the
vnode_vclocks behavior is preferred by Riak nodes, and unless explicitly configured otherwise through overrides, a cluster being upgraded from a version prior to 1.2 will negotiate that value to
true once the rolling upgrade has completed.
See the release notes for Riak 1.2.0 for more details on capabilities.
Basho is interested in improving the documentation for our configuration systems, the process of setting configuration values, and the transparency of the current values in a cluster.
All of the configuration items we discuss in this series, with the exception of
vnode_vclocks, can be redefined at the bucket level for more granular control over desired behaviors and performance characteristics.
We’ve covered a lot of useful background material; now let’s dive in, tackling two configuration parameters that directly impact how Riak handles conflict resolution and eventual consistency.
allow_mult, when set to
true, specifies that siblings should be stored whenever there is conflicting vector clock information.
last_write_wins, when set to
true, leads to code shortcuts such that vector clocks will be ignored when writing values, thus the vector clock only reflects what the client supplied, not what was already in the system.
The default behavior (with both values
false), is that new data is always written. Vector clocks are constructed from the new and any old objects, and siblings are not created.
Setting both values to
true is an unsupported configuration; currently it effectively means that
allow_mult is treated as
Read repair follows the same logic as a client put request, so these values impact its behavior as well.
There is no way to reject client writes that include outdated vector clocks, so either make certain your clients are well-behaved, or better yet set
allow_mult=true and deal with conflicts in your application.
One final warning: because Riak considers values to be opaque, siblings can be identical. If the value and application-provided metadata are identical, siblings will still be created (assuming
allow_mult=true) if the vector clocks do not allow for resolution.
In our next post, I’ll cover most of the configuration parameters that govern Riak’s key (pun intended) behaviors regarding reads and writes.