Tag Archives: vector clocks

Clocks Are Bad, Or, Welcome to the Wonderful World of Distributed Systems

November 12, 2013

A recent email thread on the Riak users mailing list highlighted one of the key weaknesses of distributed systems: clock consistency.

The first email:

Occasionally, riak seems to not store an object I try to save. I have run tcpdump on the node receiving the request to ensure it is receiving the http packets with the correct JSON from the client. When the issue occurs the node is in fact receiving the request with the correct JSON.

Riak is designed to accommodate server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers.

After some discussion, a vital piece of information was revealed:

One other thing that might be worth mentioning here is the writes I’m mentioning are actually updates to existing objects. The object exists, an attempt to write an update for the object appears to be received by a node, but the object maintains it’s original value.

Riak was dropping updates rather than writes, which is a horse of a different color. To see why updates are much more problematic for any distributed database, read on.

Concurrent Updates

In a database that runs on a single server, setting aside any complications introduced by transactions or locks, the second of two updates to the same record will overwrite the first. Last write wins.

With Riak’s simplest conflict resolution behavior, the second of two updates to the same object may or may not overwrite the first, even if those two updates are spaced far apart. Last write wins, except when it doesn’t, but even then it does.

Confused yet?

The problem is simple: there is no reliable definition of “last write”; because system clocks across multiple servers are going to drift.

On a single server, there’s one canonical clock, regardless of accuracy. The system can always tell which write occurred in which order (assuming that the clock is always increasing; setting a clock backwards can cause all sorts of bad behavior).

So, back to our original problem with lost updates:

The nodes were a bit out of synch (up to 30 seconds… looking into why ntp wasn’t working!). So far it appears this was the issue.

If two updates to the same object occur within 30 seconds in such an environment, the end result is unpredictable.

Taming the Beast

The conclusion drawn from the discussion was to implement (and, hopefully, to monitor) time synchronization. This is a step in the right direction, and one that every distributed system should implement, but there are more powerful and instructive lessons to impart.

Background Reading

Some of this discussion requires awareness of siblings, vector clocks, and related arcana. If you wish to read more about these topics, Basho’s earlier blog post Understanding Riak’s Configurable Behaviors: Part 1 provides sufficient context. (You can find links in the epilogue to the full series, but part 1 covers the necessary background for this post.)

If instead you decide you’d like to avoid reading about and dealing with such complexities entirely, skip over the Nitty Gritty section to The Land of Milk and Honey.

Nitty Gritty

Vector Clocks

One approach that should generally be employed when writing Riak applications is to supply vector clocks with each update. It’s not clear in this particular scenario that it would have helped, but it certainly can’t hurt. Giving Riak more information to track causal history is never a bad thing.

See our documentation on vector clocks for more information. And although the details are a bit dated, our blog post Why Vector Clocks are Easy makes for a nice overview of the concept.

Forcing the Last Write to Win

A rather non-obvious approach is to take the default last write wins conflict resolution one step further.

As discussed in part 1 of the configurable behaviors blog series, there are two closely-related configuration parameters that determine how Riak approaches conflict resolution: allow_mult and last_write_wins. The former indicates whether Riak should keep all conflicts for the client to resolve; the latter is our concern at the moment.

If allow_mult is set to false, setting last_write_wins to true will instruct Riak to always overwrite existing objects, ignoring the timestamps stored with them.

So, nominally, this achieves what we earlier implied to be impossible: the last write truly does win, regardless of clock consistency.

The problem is that we’ve just punted the problem down the road a bit. Yes, all servers that receive an object will blindly write it, but any servers that don’t receive it due to network partition or server failure will still retain an older value, and depending on clock consistency the older value may still win once the network or server failure is corrected.

Broadly speaking, if you’re going to have data consistency problems, it’s best for that to be obvious and easily detectable during testing stages. This “solution'; would have made the situation much harder to recognize before production.

Stopping Last Write Wins

At least in part to limit the complexity of developing applications, Basho decided to specify Riak’s default configuration as allow_mult=false, which requires the database to resolve conflicting writes internally.

As we’ve seen, Riak isn’t exactly a genius at resolving conflicting writes. Beyond the challenges of clock consistency, Riak treats objects as opaque and has no awareness of business logic.

It’s almost always better to bite the bullet: instruct Riak to retain all conflicting updates as siblings (via allow_mult=true) and write your application to deal with them appropriately.

Note: We are planning to change the default setting for allow_mult to true in Riak 2.0, but please check the documentation and your configuration before assuming either behavior.

The Land of Milk and Honey

Distributed data types

Creating data types that can survive network partitions and self-heal has long been a goal for our engineers. With Riak 1.4, Basho introduced distributed counters; with 2.0, Riak will have a larger suite of distributed data types that can resolve conflicts internally, notably including sets and maps.

Although 2.0 is not yet released, a technical preview is available.

It is also possible to define such Riak Data Types (known formally as CRDTs) at the application layer. See the two-part blog series Index for Fun and for Profit and Indexing the Zombie Apocalypse With Riak for more information.

Strong Consistency

Also with 2.0, Riak will include the option of designating certain data as strongly consistent, meaning that the servers that hold a piece of data will have to agree on any updates to that data.

As appealing as that may sound, it is impossible to guarantee strong consistency without introducing coordination overhead and constraining Riak’s ability to continue to allow for requests when servers or networks have failed.

And aren’t low latency and high availability the reasons you’re using Riak?

The Silver(*) Bullet: Immutability

(* or at least stainless steel)

The rise of “big data” is linked to a resurgence of interest in functional programming, which is particularly well-suited for processing large data sets. (See Dean Wampler’s Lambda Jam talk Copious Data for an interesting exposition of this idea.)

One of the key tenets of functional programming is that data is immutable, meaning that destructive updates are not (typically) allowed.

The relational data model does not offer much (any?) support for immutable data, but it is a powerful concept. At Basho’s inaugural RICON conference Pat Helland gave a talk entitled Immutability Changes Everything which goes into more detail.

While it isn’t necessarily true that immutability solves everything with distributed systems, it’s a great start. Without data updates, there are no conflicts.

See the configurable behaviors epilogue (specifically, the discussion of Datomic) for a discussion of configuration tweaks to Riak to take better advantage of immutable data for low latency.


If your distributed system isn’t explicitly dealing with data conflicts, any correct behavior it exhibits is more a matter of good luck than of good design.

If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.

Riak provides powerful tools for helping address the inherent challenges of distributed data, but they have to be used to be useful.

John Daily

Think Distributed: Episode 2

September 11, 2013

The second episode of Think Distributed, a podcast focused on distributed systems, is now available. This episode discusses causality, vector clocks, version vectors, and CRDTs.

This episode’s panelists are:

In addition to participating on this panel, Peter Bailis will be speaking at RICON West, Basho’s distributed systems conference. His talk, “Bad As I Wanna Be: Coordination and Consistency in Distributed Databases” will discuss how to reason about the trade-offs between coordination, consistency, latency, and availability, with a focus on practical takeaways from recent research both at Berkeley and beyond. RICON West will take place in San Francisco from October 29-30th. Tickets are still available here: ricon-west-2013.eventbrite.com/

You can listen to both this Causality episode and the Consensus episode at thinkdistributed.io


Why Vector Clocks Are Hard

April 5, 2010

A couple of months ago, Bryan wrote about vector clocks on this blog. The title of the post was “Why Vector Clocks are Easy”; anyone who read the post would realize that he meant that they’re easy for a client to use when talking to a system that implements them. For that reason, there is no reason to fear or avoid using a service that exposes the existence of vector clocks in its API.

Of course, actually implementing such a system is not easy. Two of the hardest things are deciding what  an actor is (i.e. where the incrementing and resolution is, and what parties get their own field in the vector) and how to keep vclocks from growing without bound over time.

In Bryan’s example the parties that actually proposed changes (“clients”) were the actors in the vector clocks. This is the model that vector clocks are designed for and work well with, but it has a drawback. The width of the vectors will grow proportionally with the number of clients. In a group of friends deciding when to have dinner this isn’t a problem, but in a distributed storage system the number of clients over time can be large. Once the vector clocks get that
large, they not only take up more space in disk and RAM but also take longer to compute comparisons over.

Let’s run through that same example again, but this time visualize the vector clocks throughout the scenario. If you don’t recall the whole story in the example, you should read Bryan’s post again as I am just going to show the data flow aspect of it here.

Vector Clocks by Example, in detail

Start with Alice’s initial message where she suggests Wednesday. (In the diagrams I abbreviate names, so that “Alice” will be “A” in the vclocks and so on for Ben, Cathy, and Dave.)

date = Wednesday
vclock = Alice:1

Ben suggests Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1

Dave replies, confirming Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

Now Cathy gets into the act, suggesting Thursday:

date = Thursday
vclock = Alice:1, Cathy:1

Dave has two conflicting objects:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1


date = Thursday
vclock = Alice:1, Cathy:1

Dave can tell that these versions are in conflict, because neither vclock “descends” from the other. Luckily, Dave’s a reasonable guy, and chooses Thursday. Dave also created a vector clock that is a successor to all previously-seen vector clocks. He emails this value back to Cathy.

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

So now when Alice asks Ben and Cathy for the latest decision, the replies she receives are, from Ben:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

and from Cathy:

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

From this, she can tell that Dave intended his correspondence with Cathy to override the decision he made with Ben. All Alice has to do is show Ben the vector clock from Cathy’s message, and Ben will know that he has been overruled.

That worked out pretty well.

Making it Easier Makes it Harder

Notice that even in this short and simple example the vector clock grew from nothing up to a 4-pairs mapping? In a real world scenario with long-lived data, each data element would end up with a vector clock with a length proportional to the number of clients that had ever modified it. That’s a (potentially unbounded) large growth in storage volume and computation, so it’s a good idea to think about how to prevent it.

One straightforward idea is to make the servers handling client requests be the “actors”, instead of representing the clients directly. Since any given system usually has a known bounded number of servers over time and also usually has less servers than clients, this serves to reduce and cap the size of the vclocks. I know of at least two real systems that have tried this. In addition to keeping growth under control, this approach attracts people because it means you don’t expose “hard” things like vector clocks to clients at all.

Let’s think through the same example, but with that difference, to see how it goes. We’ll assume that a 2-server distributed system is coordinating the communication, with clients evenly distributed among them. We’ll be easy on ourselves and allow for client affinity, so for the duration of the session each client will use only one server. Alice and Dave happen to get server X, and Ben and Cathy get server Y. To avoid getting too complicated here I am not going to draw the server communication; instead I’ll just abstract over it by changing the vector clocks accordingly.

We’re fine through the first few steps:

The only real difference so far is that each update increments a vector clock field named for the client’s chosen server instead of the client itself. This will mean that the number of fields needed won’t grow without bound; it will be the same as the number of servers. This is the desired effect of the change.

We run into trouble, though, when Cathy sends her update:

In the original example, this is where a conflict was created. Dave sorted out the conflict, and everything was fine. With our new strategy, though, something else happened. Ben and Cathy were both modifying from the same original object. Since we used their server id instead of their own name to identify the change, Cathy’s message has the same vector clock as Ben’s! This means that Dave’s message (responding to Ben) appears to be a simple successor to Cathy’s… and we lose her data silently!

Clearly, this approach won’t work. Remember the two systems I mentioned that tried this approach. Neither of them stuck with it once they discovered that it can be expected to silently lose updates.

For vector clocks to have their desired effect without causing accidents such as this, the elements represented by the fields in the vclock must be the real units of concurrency. In a case like this little example or a distributed storage system, that means client identifiers, not server-based ones.

Just Lose a Little Information and Everything Will Be Fine

If we use client identifiers, we’re back in the situation where vector clocks will grow and grow as more clients use a system over time. The solution most people end up with is to “prune” their vector clocks as they grow.

This is done by adding a timestamp to each field, and updating it to the current local time whenever that field is incremented. This timestamp is never used for vclock comparison — that is purely a matter of logical time — but is only for pruning purposes.

This way, when a given vclock gets too big, you can remove fields, starting at the one that was updated longest ago, until you hit a size/age threshold that makes sense for your application.

But, you ask, doesn’t this lose information?

Yes, it does — but it won’t make you lose your data. The only case where this kind of pruning will matter at all is when a client holds a very old copy of the unpruned vclock and submits data descended from that. This will create a sibling (conflict) even though you might have been able to resolve it automatically if you had the complete unpruned vclock at the server. That is the tradeoff with pruning: in exchange for keeping growth under control, you run the chance of occasionally having to do a “false merge”… but you never lose data quietly, which makes this approach unequivocally better than moving the field identifiers off of the real client and onto the server.


So, vclocks are hard: even with perfect implementation you can’t have perfect information about causality in an open system without unbounded information growth. Realize this and design accordingly.

Of course, that is just advice for people building brand new distributed systems or trying to improve existing ones. Using a system that exposes vclocks is still easy.


Why Vector Clocks are Easy

January 29, 2010

Vector clocks are confusing the first time you’re introduced to them. It’s not clear what their benefits are, nor how it is you derive said benefits. Indeed, each Riak developer has had his own set of false starts in making them behave.

The truth, though, is that vector clocks are actually very simple, and a couple of quick rules will get you all the power you need to use them effectively.

The simple rule is: assign each of your actors an ID, then make sure you include that ID and the last vector clock you saw for a given value whenever to store a modification.

The rest of this post will explain why and how to follow that simple rule. First, I’ll explain how vector clocks work with a very simple example, and then show how to use them easily in Riak.

Vector Clocks by Example

We’ve all had this problem:

Alice, Ben, Cathy, and Dave are planning to meet next week for
dinner. The planning starts with Alice suggesting they meet on
Wednesday. Later, Dave discuss alternatives with Cathy, and they
decide on Thursday instead. Dave also exchanges email with Ben, and
they decide on Tuesday. When Alice pings everyone again to find out
whether they still agree with her Wednesday suggestion, she gets
mixed messages: Cathy claims to have settled on Thursday with Dave,
and Ben claims to have settled on Tuesday with Dave. Dave can’t be
reached, and so no one is able to determine the order in which these
communications happened, and so none of Alice, Ben, and Cathy know
whether Tuesday or Thursday is the correct choice.

The story changes, but the end result is always the same: you ask two people for the latest version of a piece of information, and they reply with two different answers, and there’s no way to tell which one is really the most recent.

Vector clocks to the rescue, but how? Simple: tag the date choice with a vector clock, and then have each party member update the clock whenever they alter the choice. Start with Alice’s initial message:

date = Wednesday
vclock = Alice:1

Alice says, “Let’s meet Wednesday,” and tags that value as the first version of the message that she has seen. Now Dave and Ben start talking. Ben suggests Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1

Ben left Alice’s mark alone, but added a mark specifying that it was the first version of the message that he had seen. Dave replies, confirming Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

Just like Ben’s modification, Dave just adds his own first-revision mark. Now Cathy gets into the act, suggesting Thursday:

date = Thursday
vclock = Alice:1, Cathy:1

But wait, what happened to Ben’s and Dave’s marks? Cathy didn’t have a version of the object that had been modified by Ben or Dave, so their marks can’t appear in her modification. This means that Dave has two conflicting objects:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1


date = Thursday
vclock = Alice:1, Cathy:1

Dave can tell that these versions are in conflict, because neither vclock “descends” from the other. In order for vclock B to be considered a descendant of vclock A, each marker in vclock A must have a corresponding marker in B that has a revision number greater than or equal to the marker in vclock A. Markers not contained in a vclock can be considered to have revision number zero. So, since the Tuesday value has a Cathy revision of zero while Thursday has a Cathy revision of one, Tuesday cannot descend from Thursday. But, since Thursday has Ben and Dave revisions of zero while Tuesday has Bend and Dave revisions of one, Thursday is also not descended from Tuesday. Neither succeeds the other, so Dave has a conflict to sort out.

Luckily, Dave’s a reasonable guy, and chooses Thursday:

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

Dave also created a vector clock that is successor to all previously-seen vector clocks: it has revision numbers for every actor equal to or greater than the last revision number he saw for that actor. He emails this value back to Cathy.

So now when Alice asks Ben and Cathy for the latest decision, the replies she receive are, from Ben:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

and from Cathy:

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

From this, she can tell that Dave intended his correspondence with Cathy to override the decision he made with Ben. All Alice has to do is show Ben the vector clock from Cathy’s message, and Ben will know that he has been overruled. (Dave will, almost certainly, blame his broken email software for failing to inform Ben of the change.)

How to do this in Riak

Now that you understand vector clocks, using them with Riak is easy. I’ll use the raw HTTP interface to illustrate.

First, whenever you store a value, include an X-Riak-ClientId header to identify your actor. For Alice’s first message above, you’d say:

curl -X PUT -H "X-Riak-ClientId: Alice" -H "content-type: text/plain" 
http://localhost:8098/raw/plans/dinner --data "Wednesday"

When Ben, Cathy, and Dave each GET Alice’s plans, they’ll get the same vector clock (I’ve removed some of the other headers for brevity):

curl -i http://localhost:8098/raw/plans/dinner
HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzGDKBVIsrLnh3BlMiYx5rAzLJpw7wpcFAA==
Content-Type: text/plain
Content-Length: 9


The X-Riak-Vclock header contains an encoded version of a vclock that is the same as out earlier example: Alice has modified this value once.

Now when Ben sends his change to Dave, he includes both the vector clock he pulled down (in the X-Riak-Vclock header), and his own X-Riak-Client-Id:

curl -X PUT -H "X-Riak-ClientId: Ben" -H "content-type: text/plain" 
-H "X-Riak-Vclock: a85hYGBgzGDKBVIsrLnh3BlMiYx5rAzLJpw7wpcFAA==" 
http://localhost:8098/raw/plans/dinner --data "Tuesday"

Dave pulls down a fresh copy, and then confirms Tuesday:

curl -i http://localhost:8098/raw/plans/dinner
X-Riak-Vclock: a85hYGBgymDKBVIsrLnh3BlMiYx5rAymfeeO8EGFWRLl30GF/00ACmcBAA==
curl -X PUT -H "X-Riak-ClientId: Dave" -H "content-type: text/plain" 
-H "X-Riak-Vclock: a85hYGBgymDKBVIsrLnh3BlMiYx5rAymfeeO8EGFWRLl30GF/00ACmcBAA==" 
http://localhost:8098/raw/plans/dinner --data "Tuesday"

Cathy, on the other hand, hasn’t pulled down a new version, and instead merely updated the plans with her suggestion of Thursday:

curl -X PUT -H "X-Riak-ClientId: Cathy" -H "content-type: text/plain" 
-H "X-Riak-Vclock: a85hYGBgzGDKBVIsrLnh3BlMiYx5rAzLJpw7wpcFAA==" 
http://localhost:8098/raw/plans/dinner --data "Thursday"

(That’s the same vector clock that Ben used, in that encoded gibberish is making your eyes cross.)

Now, when Dave goes to grab this new copy (after Cathy tells him she has posted it), he’ll see one of two things. If the “plans” Riak bucket has the allow_mult property set to false, he’ll see just Cathy’s update. If allow_mult is true for the “plans” bucket, he’ll see both his last update and Cathy’s. I’m going to show the allow_mult=true version below, because I think it illustrates the flow better.

curl -i -H "Accept: multipart/mixed" http://localhost:8098/raw/plans/dinner
HTTP/1.1 300 Multiple Choices
X-Riak-Vclock: a85hYGBgzWDKBVIsrLnh3BlMiYx5rAymfeeO8EGFWRLl30GF1fsRwsypF59BhT0mIoTZ/1SYQIUrEcJszUksu9R6kCWyAA==
Content-Type: multipart/mixed; boundary=ZZ3eyjUllBi7GXRRMJsUublFxjn
Content-Length: 368

Content-Type: text/plain

Content-Type: text/plain


Dave sees two values because the vclock that Cathy generated wasn’t a successor to the vclock that Dave had generated with his last modification. Riak couldn’t choose between them, and therefore kept both values.

Dave picks Thursday, and updates the object, resolving the conflict. Riak has already computed a unified, descendant vector clock for Dave, so he uses the vector clock from the multi-value version he just pulled down, just like before:

curl -X PUT -H "X-Riak-ClientId: Dave" -H "content-type: text/plain" 
-H "X-Riak-Vclock: a85hYGBgzWDKBVIsrLnh3BlMiYx5rAymfeeO8EGFWRLl30GF1fsRwsypF59BhT0mIoTZ/1SYQIUrEcJszUksu9R6kCWyAA==" 
http://localhost:8098/raw/plans/dinner --data "Thursday"

Now when Alice check for the latest version, she just sees the final decision:

curl -i http://localhost:8098/raw/plans/dinner
HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzWDKBVIsrLnh3BlMiYx5rAymfeeO8EGFWRLl30GF1fvhwmzNSSy71HqgEpUTEerZ/1SYYBFmTr34DCjMBBTOnQwUzgIA
Content-Type: text/plain
Content-Length: 7


While Riak couldn’t decide whether to choose Cathy’s modification over Dave’s earlier modification, it was easy to choose Dave’s latest modification, because the vclock created was a successor to the vclock in place.


So, vclocks are easy: assign each of your actors an ID (“Alice”, “Ben”, “Cathy”, and “Dave” in these examples), then make sure you include that ID and the last vector clock you saw for a given value whenever to store a modification.

If two actors store changes with vector clocks that don’t descend from each other, Riak will store and hand back both values. When descendancy can be calculated, values stored with vector clocks that have been succeeded will be removed.