Tag Archives: Riak

Toward A Consistent, Fact-based Comparison of Emerging Database Technologies

A Muddle That Slows Adoption

Basho released Riak as an open source project seven months ago and began commercial service shortly thereafter. As new entrants into the loose collection of database projects we observed two things:

  1. Widespread Confusion — the NoSQL rubric, and the decisions of most projects to self-identify under it, has created a false perception of overlap and similarity between projects differing not just technically but in approaches to licensing and distribution, leading to…
  2. Needless Competition — driving the confusion, many projects (us included, for sure) competed passionately (even acrimoniously) for attention as putative leaders of NoSQL, a fool’s errand as it turns out. One might as well claim leadership of all tools called wrenches.

The optimal use cases, architectures, and methods of software development differ so starkly even among superficially similar projects that to compete is to demonstrate a (likely pathological) lack of both user needs and self-knowledge.

This confusion and wasted energy — in other words, the market inefficiency — has been the fault of anyone who has laid claim to, or professed designs on, the NoSQL crown.

  1. Adoption suffers — Users either make decisions based on muddled information or, worse, do not make any decision whatsoever.
  2. Energy is wasted — At Basho we spent too much time from September to December answering the question posed without fail by investors and prospective users and clients: “Why will you ‘win’ NoSQL?”

With the vigor of fools, we answered this question, even though we rarely if ever encountered another project’s software in a “head-to-head” competition. (In fact, in the few cases where we have been pitted head-to-head against another project, we have won or lost so quickly that we cannot help but conclude the evaluation could have been avoided altogether.)

The investors and users merely behaved as rational (though often bemused) market actors. Having accepted the general claim that NoSQL was a monolithic category, both sought to make a bet.

Clearly what is needed is objective information presented in an environment of cooperation driven by mutual self-interest.

This information, shaped not by any one person’s necessarily imperfect understanding of the growing collection of data storage projects but rather by all the participants themselves, would go a long way to remedying the inefficiencies discussed above.

Demystification through data, not marketing claims

We have spoken to representatives of many of the emerging database projects. They have enthusiastically agreed to participate in a project to disclose data about each project. Disclosure will start with the following: a common benchmark framework and benchmarks/load tests modeling common use cases.

    1. A Common Benchmark Framework — For this collaboration to succeed, no single aspect will impact success or failure more than arriving at a common benchmark framework.

At Basho we have observed the proliferation of “microbenchmarks,” or benchmarks that do not reflect the conditions of a production environment. Benchmarks that use a small data set, that do not store to disk, that run for short (less than 12 hours) durations, do more to confuse the issue for end users than any single other factor. Participants will agree on benchmark methods, tools, applicability to use cases, and to make all benchmarks reproducible.

Compounding the confusion is when benchmarks are used for comparison of different use cases or was run on different hardware and yet compared head-to-head as if the tests or systems were identical. We will seek to help participants run equivalent on the various databases and we will not publish benchmark results that do not profile the hardware and configuration of the systems.

  1. Benchmarks That Support Use Cases — participants agree to benchmark their software under the conditions and with load tests reflective of use cases they commonly see in their user base or for which they think their software is best suited.
  2. Dissemination to third-parties — providing easy-to-find data to any party interested in posting results.
  3. Honestly exposing disagreement — Where agreement cannot be reached on any of the elements of the common benchmarking efforts, participants will respectfully expose the rationales for their divergent positions, thus still providing users with information upon which to base decisions.

There is more work to be done but all participants should begin to see the benefits: faster, better decisions by users.

We invite others to join, once we are underway. We, and our counterparts at other projects, believe this approach will go a long way to removing the inefficiencies hindering adoption of our software.

Tony and Justin

Why Vector Clocks Are Hard

April 5, 2010

A couple of months ago, Bryan wrote about vector clocks on this blog. The title of the post was “Why Vector Clocks are Easy”; anyone who read the post would realize that he meant that they’re easy for a client to use when talking to a system that implements them. For that reason, there is no reason to fear or avoid using a service that exposes the existence of vector clocks in its API.

Of course, actually implementing such a system is not easy. Two of the hardest things are deciding what  an actor is (i.e. where the incrementing and resolution is, and what parties get their own field in the vector) and how to keep vclocks from growing without bound over time.

In Bryan’s example the parties that actually proposed changes (“clients”) were the actors in the vector clocks. This is the model that vector clocks are designed for and work well with, but it has a drawback. The width of the vectors will grow proportionally with the number of clients. In a group of friends deciding when to have dinner this isn’t a problem, but in a distributed storage system the number of clients over time can be large. Once the vector clocks get that
large, they not only take up more space in disk and RAM but also take longer to compute comparisons over.

Let’s run through that same example again, but this time visualize the vector clocks throughout the scenario. If you don’t recall the whole story in the example, you should read Bryan’s post again as I am just going to show the data flow aspect of it here.

Vector Clocks by Example, in detail

Start with Alice’s initial message where she suggests Wednesday. (In the diagrams I abbreviate names, so that “Alice” will be “A” in the vclocks and so on for Ben, Cathy, and Dave.)

date = Wednesday
vclock = Alice:1

Ben suggests Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1

Dave replies, confirming Tuesday:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

Now Cathy gets into the act, suggesting Thursday:

date = Thursday
vclock = Alice:1, Cathy:1

Dave has two conflicting objects:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

and

date = Thursday
vclock = Alice:1, Cathy:1

Dave can tell that these versions are in conflict, because neither vclock “descends” from the other. Luckily, Dave’s a reasonable guy, and chooses Thursday. Dave also created a vector clock that is a successor to all previously-seen vector clocks. He emails this value back to Cathy.

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

So now when Alice asks Ben and Cathy for the latest decision, the replies she receives are, from Ben:

date = Tuesday
vclock = Alice:1, Ben:1, Dave:1

and from Cathy:

date = Thursday
vclock = Alice:1, Ben:1, Cathy:1, Dave:2

From this, she can tell that Dave intended his correspondence with Cathy to override the decision he made with Ben. All Alice has to do is show Ben the vector clock from Cathy’s message, and Ben will know that he has been overruled.

That worked out pretty well.

Making it Easier Makes it Harder

Notice that even in this short and simple example the vector clock grew from nothing up to a 4-pairs mapping? In a real world scenario with long-lived data, each data element would end up with a vector clock with a length proportional to the number of clients that had ever modified it. That’s a (potentially unbounded) large growth in storage volume and computation, so it’s a good idea to think about how to prevent it.

One straightforward idea is to make the servers handling client requests be the “actors”, instead of representing the clients directly. Since any given system usually has a known bounded number of servers over time and also usually has less servers than clients, this serves to reduce and cap the size of the vclocks. I know of at least two real systems that have tried this. In addition to keeping growth under control, this approach attracts people because it means you don’t expose “hard” things like vector clocks to clients at all.

Let’s think through the same example, but with that difference, to see how it goes. We’ll assume that a 2-server distributed system is coordinating the communication, with clients evenly distributed among them. We’ll be easy on ourselves and allow for client affinity, so for the duration of the session each client will use only one server. Alice and Dave happen to get server X, and Ben and Cathy get server Y. To avoid getting too complicated here I am not going to draw the server communication; instead I’ll just abstract over it by changing the vector clocks accordingly.

We’re fine through the first few steps:

The only real difference so far is that each update increments a vector clock field named for the client’s chosen server instead of the client itself. This will mean that the number of fields needed won’t grow without bound; it will be the same as the number of servers. This is the desired effect of the change.

We run into trouble, though, when Cathy sends her update:

In the original example, this is where a conflict was created. Dave sorted out the conflict, and everything was fine. With our new strategy, though, something else happened. Ben and Cathy were both modifying from the same original object. Since we used their server id instead of their own name to identify the change, Cathy’s message has the same vector clock as Ben’s! This means that Dave’s message (responding to Ben) appears to be a simple successor to Cathy’s… and we lose her data silently!

Clearly, this approach won’t work. Remember the two systems I mentioned that tried this approach. Neither of them stuck with it once they discovered that it can be expected to silently lose updates.

For vector clocks to have their desired effect without causing accidents such as this, the elements represented by the fields in the vclock must be the real units of concurrency. In a case like this little example or a distributed storage system, that means client identifiers, not server-based ones.

Just Lose a Little Information and Everything Will Be Fine

If we use client identifiers, we’re back in the situation where vector clocks will grow and grow as more clients use a system over time. The solution most people end up with is to “prune” their vector clocks as they grow.

This is done by adding a timestamp to each field, and updating it to the current local time whenever that field is incremented. This timestamp is never used for vclock comparison — that is purely a matter of logical time — but is only for pruning purposes.

This way, when a given vclock gets too big, you can remove fields, starting at the one that was updated longest ago, until you hit a size/age threshold that makes sense for your application.

But, you ask, doesn’t this lose information?

Yes, it does — but it won’t make you lose your data. The only case where this kind of pruning will matter at all is when a client holds a very old copy of the unpruned vclock and submits data descended from that. This will create a sibling (conflict) even though you might have been able to resolve it automatically if you had the complete unpruned vclock at the server. That is the tradeoff with pruning: in exchange for keeping growth under control, you run the chance of occasionally having to do a “false merge”… but you never lose data quietly, which makes this approach unequivocally better than moving the field identifiers off of the real client and onto the server.

Review

So, vclocks are hard: even with perfect implementation you can’t have perfect information about causality in an open system without unbounded information growth. Realize this and design accordingly.

Of course, that is just advice for people building brand new distributed systems or trying to improve existing ones. Using a system that exposes vclocks is still easy.

-Justin

Schema Design in Riak – Relationships

March 25, 2010

In the previous installment we looked at how your reasons for picking Riak affect how your schema should be designed, and how you might go about structuring your data at the individual object level. In this post we’ll look at how to design relationships on top of Riak.

Relationships? I thought Riak was key-value.

An even mildly-complicated application is going to have more than one type of data to store and manipulate. Those data are not islands, but have relationships to one another that make your application and its domain more than just arbitrary lists of things.

Yes, at its core, Riak is a key-value store or distributed hash-table. Because key-value stores are not very sophisticated at modeling more complicated relationships, Riak adds the concept of links between objects that are qualified by “tags” and can be easily queried using “link-walking”.

Now, the knee-jerk reaction would be to start adding links to everything. I want to show you that the problem of modeling relationships is a little more nuanced than just linking everything together, and that there are many ways to express the same relationship — each having tradeoffs that you need to consider.

Key correspondence

The easiest way to establish a relationship is to have some correspondence between the keys of the items. This works well for one-to-one and some one-to-many relationships and is easy to understand.


In the simplest case, your related objects have the same key, but different buckets. Lookups on this type of relationship are really efficient, you just change the bucket name to find the other item. How is this useful? Why not just store them together? One of the objects may get updated or read more often than the other. Their data types might be incompatible (a user profile and its avatar, for example). In either case, you get the benefit of the separation and fast access without needing link-walking or map-reduce; however, you really can only model one-to-one relationships with this pattern.

For one-to-many types of relationships, you might prefix or otherwise derive the key of the dependent (many) side of the relationship with the key of the parent side. This could be done as part of the bucket name, or as a simple prefix to the key. There are a couple of important tradeoffs to consider here. If you choose the bucket route, the number of buckets might proliferate in proportion to your data quantity. If you choose to prefix the key, it will be easy to find the parent object, but may be more difficult to find the dependent objects. The same reasons as having equivalent keys apply here — tight cohesion between the objects but different access patterns or internal structure.

De-normalization / Composition

A core principle in relational schema design is factoring your relations so that they achieve certain “normal forms”, especially in one-to-many sorts of relationships. This means that if your domain concept “has” any number of something else, you’ll make a separate table for that thing and insert a foreign key that points back to the owner. De-normalizing (or composing) your data often makes sense, both for the sake of performance and for ease of modeling.

How does this work? Let’s say your relational database had tables for people and for addresses. A person may have any number of addresses for home, work, mailing, etc, which are related back to the person by way of foreign key. In Riak, you would give your person objects an “addresses” attribute, in which you would store a list or hash of their addresses. Because the addresses are completely dependent on the person, they can be a part of the person object. If addresses are frequently accessed at the same time as the person, this also results in fewer requests to the database.

Composition of related data is not always the best answer, even when a clear dependency exists; take for instance, the Twitter model. Active users can quickly accrue thousands of tweets, which need to be aggregated in different combinations across followers’ timelines. Although the tweet concept is dependent on the user, it has more conceptual weight than the user does and needs to stand by itself. Furthermore, performance would suffer if you had to pull all of a user’s tweets every time you wanted to see their profile data.

Good candidates for composition are domain concepts that are very dependent on their “owner” concept and are limited in number. Again, knowing the shape of your data and the access pattern are essential to making this decision.

Links

Links are by far the most flexible (and popular) means for modeling relationships in Riak, and it’s obvious to see why. They hold the promise of giving a loose graph-like shape to your relatively flat data and can cleanly represent any cardinality of relationship. Furthermore, link-walking is a really attractive way to quickly do queries that don’t need the full power of map-reduce (although Riak uses map-reduce behind the scenes to traverse the links). To establish a relationship, you simply add a link on the object to the other object.

Intrinsically, links have no notion of cardinality; establishing that is entirely up to your application. The primary difference is whether changing an association replaces or adds/removes links from the associated objects. Your application will also have to do some accounting about which objects are related to other objects, and establish links accordingly. Since links are uni-directional, stored on the source, and incoming links are not automatically detected, your application will need to add the reciprocal links when traversals in both directions are needed (resulting in multiple PUT operations). In some cases, especially in one-to-many relationships where the “many” side is not accessed independently, you might not need to establish the reciprocal link. Knowing how your data will be accessed by the application — both reads and writes — will help you decide.

Links have a few other limitations that you will need to consider. First, although the tag part of the link can technically be any Erlang term, using anything other than a binary string may make it difficult for HTTP-based clients to deal with them. Second, since links are stored directly with the object in its metadata, objects that have many links will be slower to load, store, and perform map-reduce queries over. In the HTTP/REST interface as well, there are practical limitations simply because of the method of transport. At the time of writing, mochiweb — the library that is the foundation of webmachine, Riak’s HTTP interface — uses an 8K buffer for incoming requests and limits the request to 1000 header fields (including repeated headers). This means that each Link: header you provide needs to be less than 8K in length, and assuming you use the typical headers when storing, you can have at most about 995 individual Link: headers. By the time you reach the approximately 150,000 links that that provides, you’ll probably want to consider other options anyway.

Hybrid solutions

At this point, you might be wondering how your data is going to fit any of these individual models. Luckily, Riak is flexible, so you can combine them to achieve a schema that best fits your need. Here’s a few possibilities.

Often, either the number of links on an object grows large or the need to update them independently of the source object arises. In our Twitter example, updating who you follow is a significantly different operation from updating your user profile, so it makes sense to store those separately, even though they are technically a relationship between two users. You might have the user profile object and list of followed users as key-correspondent objects, such as users/seancribbs and following/seancribbs (not taking into account your followers, of course).

In relational databases you typically use the concept of a “join table” to establish many-to-many relationships. The intermediary table holds foreign keys back to the associated objects, and each row represents one individual association, essentially an “adjacency list”. As your domain becomes more complex and nuanced, you might find that these relationships represented by join tables become domain concepts in their own right, with their own attributes. In Riak, you might initially establish many-to-many relationships as links on both sides. Similarly to the “join table” issue, the relationship in the middle might deserve an object of its own. Some examples that might warrant this design: qualified relationships (think “friends” on Facebook, or permissions in an ACL scheme), soft deletion, and history (tracking changes).

Key correspondence, composition and linking aren’t exclusive ways to think of relationships between data in your application, but tools to establish the semantics your domain requires. I’ve said it many times already, but carefully evaluate the shape of your data, the semantics you want to impose on it, and the operational profile of your application when choosing how you structure your data in Riak.

Sean Cribbs

The Craft Brewers of NoSQL

Just a few days ago, we did something a bit new at Basho. We posted the beginning of a public discussion to explore and document some differences between various NoSQL systems. Some people have attempted such comparisons before, but generally from an external observer’s point of view. When something like this comes from a producer of one of the systems in question it necessarily changes the tone.

If you weren’t really paying attention you could choose to see this as aggressive competition on our part, but the people that have chosen to engage with us have hopefully seen that it’s the opposite: an overt attempt at collaboration. While the initial comparisons were definitely not complete (for instance, in some cases they reflected the current self-documented state of systems instead of the current state of their code) they nonetheless very much had the desired effect.

That effect was to create productive conversation, resulting both in improvement of the comparison documents and in richer ongoing communication between the various projects out there. Our comparisons have already improved a great deal as a result of this and will continue to do so. I attribute this to the constructive attention that they have received from people deeply in the trenches with the various systems being discussed. That attention has also, I hope, given us a concrete context in which to strengthen our relationships with those people and projects.

Some of the attention we received was from people that are generally unhelpful; there are plenty of trolls on the internet who are more interested in throwing stones than in useful conversation. There’s not much point in wading into that kind of a mess as everyone gets dirty and nothing improves as a result. Luckily, we also received attention from people who actually build things. Those sorts of people tend to be much more interested in productive collaboration, and that was certainly the case this time. Though they’re by no means the only ones we’ve been happy to talk to, we can explicitly thank Greg Arnette, Jonathan Ellis, Benjamin Black, Mike Dirolf, Joe Stump, and Peter Neubauer for being among those spending their valuable time talking with us recently.

It’s easy to claim that any attempt to describe others that isn’t immediately perfect is just FUD, but our goal here is to help remove the fear, uncertainty, and doubt that people outside this fledgling community already have about all of this weird NoSQL stuff. By engaging each other in direct, honest, open, civil conversations we can all improve our knowledge as well as the words we use to describe each others’ work.

The people behind the various NoSQL systems today have a lot in common with the American craft brewers of the 1980s and 1990s. (Yes, I’m talking about beer.) You might casually imagine that people trying to sell similar products to the same market would be cutthroat competitors, but you’d be wrong. When you are disrupting a much larger established industry, aggression amongst peers isn’t a good route to success.

The friend who convinced me to try a Sam Adams in 1993 wasn’t striking a blow against Sierra Nevada or any of the other craft brewers at the time. In fact, he was helping all of those supposed “competitors” by opening up one more pair of eyes to the richness of choices that are available to all. People who enjoy good beer will happily talk about the differences between their brews all day, but in the end what matters is that when they walk into a bar they will look to see what choices they have at the tap instead of just ordering the same old Bud without a second thought.

Understanding that “beer” doesn’t always mean exactly the same identical beverage is the key realization, just as with NoSQL the most important thing people outside the community can realize is that not all data problems are shaped like a typical RDBMS.

Of course, any brewer will talk more about their own product more than anything else, but will also know that good conversations lead to improvements by all and the potential greater success of the entire community they exist in. Sometimes the way to start a good conversation is to talk about what you know best, with people that you know will have a different point of view than your own. From there, everyone’s knowledge, perspective, and understanding can improve.

At Basho we’re not just going to keep doing what we’ve already done in terms of communication. We’re going to keep finding new and better ways of communicating, and do it more often.

In addition to continuing to work with others on finding the right ways to publicly discuss our differences and similarities on fundamentals, we will also do so in specific areas such as performance testing, reliability under duress, and more. This will remain tricky, because it is easy for people to get confused by superficial issues and distracted from the more interesting ones — and opinions will vary on which are which. In discussing those opinions, we will all become more capable practitioners and advocates of the craft that binds us together.

Ruffling a few feathers is a low cost to pay, if better conversations occur. This is especially true if the people truly creating value by building systems learn how to work better together in the process.

Justin

Riak in Production – A Distributed Event Registration System Written in Erlang

March 20, 2010

Riak, at its core, is an open source project. So, we love the opportunity to hear from our users and find out where and how they are using Riak in their applications. It is for that reason that we were excited to hear from Chris Villalobos. He recently put a Distributed Event Registration application into production at his church in Gainesville, Florida, and after hearing a bit about it, we asked him to write a short piece about it for the Basho Blog.

Use Case and Prototyping

As a way of going paperless at our church, I was tasked with creating an event registration system that was accessible via touchscreen kiosk, SMS, and our website, to be used by members to sign up for various events. As I was wanting to learn a new language and had dabbled in Erlang (specifically Mochiweb) for another small application, I decided that I was going to try and do the whole thing in Erlang. But how to do it, and on a two month time line, was quite the challenge.

The initial idea was to have each kiosk independently hold pieces of the database, so that in the event something happened to a server or a kiosk, the data would still be available. Also, I wanted to use the fault-tolerance and distributed processing of Erlang to help make sure that the various frontends would be constantly running and online. And, as I wanted to stay as close to pure Erlang as possible, I decided early against a SQL database. I tried Mnesia but I wasn’t happy with the results. Using QLC as an interface, interesting issues arose when I took down a master node. (I was also facing a time issue so playing with it extensively wasn’t really an option.)

It just so happened that Basho released Riak 0.8 the morning I got fed up with it. So I thought about how I could use a key/value store. I liked how the Riak API made it simple to get data in and out of the database, how I could use map-reduce functionality to create any reports I needed and how the distribution of data worked out. Most importantly, no matter what nodes I knocked out while the cluster was running, everything just continued to click. I found my datastore.

During the initial protoyping stages for the kiosk, I envisioned a simple key/value store using a data model that looked something like this:

“`erlang
[
{key1, {Title, Icon, Background Image, Description, [signup_options]}},
{key2, {…}}
]
“`

This design would enable me to present the user with a list of options when the kiosk was started up. I found that by using Riak, this was simple to implement. I also enjoyed that Riak was great at getting out of the way. I didn’t have to think about how it was going to work, I just knew that it would. ( The primary issue I kept running into when I thought about future problems was sibling entries. If two users on two kiosks submit information at the same time for the same entry, (potentially an issue as the number of kiosks grow), then that would result in sibling entries because of the way user data is stored:

“`erlang
<>, <>, [user data]
“`

But, by checking for siblings when the reports are generated, this problem became a non-issue.)

High Level Architecture

The kiosk is live and running now with very few kinks (mostly hardware) and everything is in pure Erlang. At a high level, the application architecture looks like this:

Each Touchscreen Kiosk:

  • wxErlang
  • Riak node

Web-Based Management/SMS Processing Layer:

  • Nitrogen Framework speaking to Riak for Kiosk Configuration/Reporting
  • Nitrogen/Mochiweb processing SMS messages from SMS aggregator

Periodic Email Sender:

  • Vagabond’s gen_smtp client on a eternal receive after 24 hours send email-loop.

In Production

Currently, we are running four Riak nodes (writing out to the Filesystem backend) outside of the three Kiosks themselves. I also have various Riak nodes on my random linux servers because I can use the CPU cycles on my other nodes to distribute MapReduce functions and store information in a redundant fashion.

By using Riak, I was able to keep the database lean and mean with creative uses of keys. Every asset for the kiosk is stored within Riak, including images. These are pulled only whenever a kiosk is started up or whenever an asset is created, updated, or removed (using message passing). If an image isn’t present on a local kiosk, it is pulled from the database and then stored locally. Also, all images and panels (such as the on-screen keyboard) are stored in memory to make things faster.

All SMS messages are stored within an SMS bucket. Every 24 hours all the buckets are checked with a “mapred_bucket” to see if there are any new messages since the last time the function ran. These results are formatted within the MapReduce function and emailed out using the gen_smtp client. As assets are removed from the system, the current data is stored within a serialized text file and then removed the database.

As I bring more kiosks into operation, the distributed map-reduce feature is becoming more valuable. Since I typically run reports during off hours, the kiosks aren’t overloaded by the extra processing power. So far I have been able to roll out a new kiosk within 2 hours of receiving the hardware. Most of this time is spent doing the installation and configuration of the touchscreen. Also, the system is becoming more and more vital to how we are interfacing with people, giving members multiple ways of contacting us at their convenience. I am planning on expanding how I use the system, especially with code-distribution. For example, with the Innostore interface, I might store the beam files inside and send them to the kiosks using Erlang commands. (Version Control inside Riak, anyone?)

What’s Next?

I have ambitious plans for the system, especially on the kiosk side. As this is a very beta version of the software, it is only currently in production in our little community. That said, I hope to open source it and put it on github/bitbucket/etc. as soon as I pretty up all the interfaces.

I’d say probably the best thing about this whole project is getting to know the people inside the Erlang community, especially the Basho people and the #erlang regulars on IRC. Anytime I had a problem, someone was there willing to work through it with me. Since I am essentially new to Erlang, it really helped to have a strong sense of community. Thank you to all the folks at Basho for giving me a platform to show what Erlang can do in everyday, out of the way places.

Chris Villalobos

Schema Design in Riak – Introduction

March 19, 2010

One of the challenges of switching from a relational database (Oracle, MySQL, etc.) to a “NoSQL” database like Riak is understanding how to represent your data within the database. This post is the beginning of a series of entries on how to structure your data within Riak in useful ways.

Choices have consequences

There are many reasons why you might choose Riak for your database, and I’m going to explain how a few of those reasons will affect the way your data is structured and manipulated.

One oft-cited reason for choosing Riak, and other alternative databases, is the need to manage huge amounts of data, collectively called “Big Data”. If you’re storing lots of data, you’re less likely to be doing online queries across large swaths of the data. You might be doing real-time aggregation in addition to calculating longer-term information in the background or offline. You might have one system collecting the data and another processing it. You might be storing loosely-structured information like log data or ad impressions. All of these use-cases call for low ceremony, high availability for writes, and little need for robust ways of finding data — perfect for a key/value-style scheme.

Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.

The third reason for choosing Riak I want to discuss is related to CAP – in that Riak prefers A (Availability) over C (Consistency). In contrast to a traditional relational database system, in which transactional semantics ensure that a datum will always be in a consistent state, Riak chooses to accept writes even if the state of the object has been changed by another client (in the case of a race-condition), or if the cluster was partitioned and the state of the object diverges. These architecture choices bring to the fore something we should have been considering all along — how should our applications deal with inconsistency? Riak lets you choose whether to let the “last one win” or to resolve the conflict in your application by automated or human-assisted means.

More mindful domain modeling

What’s the moral of these three stories? When modeling your data in Riak, you need to understand better the shape of your data. You can no longer rely on normalization, foreign key constraints, secondary indexes and transactions to make decisions for you.

Questions you might ask yourself when designing your schema:

  • Will my access pattern be read-heavy, write-heavy, or balanced?
  • Which datasets churn the most? Which ones require more sophisticated conflict resolution?
  • How will I find this particular type of data? Which method is most efficient?
  • How independent/interrelated is this type of data with this other type of data? Do they belong together?
  • What is an appropriate key-scheme for this data? Should I choose my own or let Riak choose?
  • How much will I need to do online queries on this data? How quickly do I need them to return results?
  • What internal structure, if any, best suits this data?
  • Does the structure of this data promote future design modifications?
  • How resilient will the structure of the data be if requirements change? How can the change be effected without serious interruption of service?

I like to draw up my domain concepts on a pad of unlined paper or a whiteboard with boxes and arrows, then figure out how they map onto the database. Ultimately, the concepts define your application, so get those solid before you even worry about Riak.

Thinking non-relationally

Once you’ve thought carefully about the questions described above, it’s time think about how your data will map to Riak. We’ll start from the small-scale in this post (single domain concepts) and work our way out in future installments.

Internal structure

For a single class of objects in your domain, let’s consider the structure of that data. Here’s where you’re going to decide two interrelated issues — how this class of data will be queried and how opaque its internal structure will be to Riak.

The first issue, how the data will be queried, depends partly on how easy it is to intuit the key of a desired object. For example, if your data is user profiles that are mostly private, perhaps the user’s email or login name would be appropriate for the key, which would be easy to establish when the user logs in. However, if the key is not so easy to determine, or is arbitrary, you will need map-reduce or link-walking to find it.

The second issue, how opaque the data is to Riak, is affected by how you query but also by the nature of the data you’re storing. If you need to do intricate map-reduce queries to find or manipulate the data, you’ll likely want it in a form like JSON (or an Erlang term) so your map and reduce functions can reason about the data. On the other hand, if your data is something like an image or PDF, you don’t want to shoehorn that into JSON. If you’re in the situation where you need both a form that’s opaque to Riak, and to be able to reason about it with map-reduce, have your application add relevant metadata to the object. These are created using X-Riak-Meta-* headers in HTTP or riak_object:update_metadata/2 in Erlang.

Rule of thumb: if it’s an abstract datatype, use a map-reduce-friendly format like JSON; if it’s a concrete form, use its original representation. Of course, there are exceptions to every rule, so think carefully about your modeling problem.

Consistency, replication, conflict resolution

The second issue I would consider for each type of data is the access pattern and desired level of consistency. This is related to the questions above of read/write loads, churn, and conflicts.

Riak provides a few knobs you can turn at schema-design time and at request-time that relate to these issues. The first is allow_mult, or whether to allow recording of divergent versions of objects. In a write-heavy load or where clients are updating the same objects frequently, possibly at the same time, you probably want this on (true), which you can change by setting the bucket properties. The tradeoffs are that the vector clock may grow quickly and your application will need to decide how to resolve conflicts.

The second knob you can turn is the n_val, or how many replicas of each object to store, also a per-bucket setting. The default value is 3, which will work for many applications. If you need more assurance that your data is going to withstand failures, you might increase the value. If your data is non-critical or in large chunks, you might decrease the value to get greater performance. Knowing what to choose for this value will depend on an honest assessment of both the value of your data and operational concerns.

The third knob you can turn is per-request quorums. For reads, this is the R request parameter: how many replicas need to agree on the value for the read to succeed (the default is 2). For writes, there are two parameters, W and DW. W is how many replicas need to acknowledge the write request before it succeeds (default is 2). DW (durable writes) is how many replica backends need to confirm that the write finished before the entire write succeeds (default is 0). If you need greater consistency when reading or writing your data, you’ll want to increase these numbers. If you need greater performance and can sacrifice some consistency, decrease them. In any case, your R, W, and DW values must be smaller than n_val if you want the request to succeed.

What do these have to do with your data model? Fundamentally understanding the structure and purpose of your data will help you determine how you should turn these knobs. Some examples:

  • Log data: You’ll probably want low R and W values so that writes are accepted quickly. Because these are fire-and-forget writes, you won’t need allow_mult turned on. You might also want a low n_val, depending on how critical your data is.
  • Binary files: Your n_val is probably the most significant issue here, mostly depending on how large your files are and how many replicas of them you can tolerate (storage consumption).
  • JSON documents (abstract types): The defaults will work in most cases. Depending on how frequently the data is updated, and how many you update within a single conceptual operation with the application, you may want to enable allow_mult to prevent blind overwrites.

Sean Cribbs

 

The Release of the Riak Wiki and the Fourth Basho Podcast

March 12, 2010

We are moving at warp speed here at Basho and today we are releasing what we feel is a very important enhancement to Riak: a wiki.

You can find it here:

http://docs.basho.com

Documentation and resources are a main priority right now for Basho, and a well maintained and up-to-date wiki is something we see as critical. Our goal is to make Riak simple and intuitive to download, build, program against, and build apps on. So, you should expect a lot more from us in this regard. Also, we still have much to add to the Riak Wiki, so if you think we are missing a resource or some documentation that makes Riak easier to use and learn about, please tell us.

Secondly, we had the chance to record the fourth installment of the Basho Riak podcast (below), and it was a good one. We hooked up with Tim Anglade, CTO of GemKitty and a growing authority on the NoSQL space. On the heels of his presentation at NoSQL Live from Boston, we picked his brain a bit about where he thinks the industry is going and what needs to change for the current iteration of NoSQL to go from being a fad and curiosity to a full fledged industry.

According to Tim, “We have an image problem right now with NoSQL as a brand,” and “NoSQL is over-hyped and the projects behind it are under-hyped.”

We also took a few minutes to talk about the Riak 0.9.1 release. Highlights include binary builds, as well as several new client libraries that expose all of Riak’s advanced features.

In short, if you are at all interested in the future of the NoSQL space, you’re not going to want to miss this.

Lastly, if you haven’t already done so, go download the latest version of Riak.

Enjoy!

Mark

Link Walking By Example

February 24, 2010

Riak has a notion of “links” as part of the metadata of its objects. We talk about traversing, or “walking”, links, but what do the queries for doing so actually look like?

Let’s put four objects in riak:

  1. hb/first will link to hb/second and hb/third
  2. hb/second will link to hb/fourth
  3. hb/third will also link to hb/fourth
  4. hb/fouth doesn’t link anywhere
$ curl -X PUT -H "content-type:text/plain" 
  -H "Link: </riak/hb/second>; riaktag="foo", </riak/hb/third>; riaktag="bar"" 
  http://localhost:8098/riak/hb/first --data "hello"

$ curl -X PUT -H "content-type: text/plain" 
  -H "Link:</riak/hb/fourth>; riaktag="foo"" 
  http://localhost:8098/riak/hb/second --data "the second"

$ curl -X PUT -H "content-type: text/plain" 
  -H "Link:</riak/hb/fourth>; riaktag="foo"" 
  http://localhost:8098/riak/hb/third --data "the third"

$ curl -X PUT -H "content-type: text/plain" 
  http://localhost:8098/riak/hb/fourth --data "the fourth"

Now, say we wanted to start at hb/first, and follow all of its outbound links. The easiest way to do this is with the link-walker URL syntax:

$ curl http://localhost:8098/riak/hb/first/_,_,_

The response will be a multipart/mixed body with two parts: the hb/second object in one, and the hb/third object in the other:

--N2gzGP3AY8wpwdQY0jio62L9nJm
Content-Type: multipart/mixed; boundary=3ai6VRl4aLli3dKw8tG9unUeznT

--3ai6VRl4aLli3dKw8tG9unUeznT
X-Riak-Vclock: a85hYGBgzGDKBVIsTKLLozOYEhnzWBn+H/h5hC8LAA==
Location: /riak/hb/third
Content-Type: text/plain
Link: </riak/hb>; rel="up", </riak/hb/fourth>; riaktag="foo"
Etag: 5Fs0VskZWx7Y25tf1oQsvS
Last-Modified: Wed, 24 Feb 2010 15:25:51 GMT

the third
--3ai6VRl4aLli3dKw8tG9unUeznT
X-Riak-Vclock: a85hYGBgzGDKBVIsLEHbN2YwJTLmsTLMPvDzCF8WAA==
Location: /riak/hb/second
Content-Type: text/plain
Link: </riak/hb>; rel="up", </riak/hb/fourth>; riaktag="foo"
Etag: 2ZKEJ2gaT57NT7xhLDPCQz
Last-Modified: Wed, 24 Feb 2010 15:24:11 GMT

the second
--3ai6VRl4aLli3dKw8tG9unUeznT--

--N2gzGP3AY8wpwdQY0jio62L9nJm--

It’s also possible to express the same query in map-reduce, directly:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

That’s the exact same query. The content type of the response is different. It’s now a JSON array with two elements: a JSON encoding of the hb/second object, and a JSON encoding of the hb/third object. (Pretty-printed here, for clarity.)

[
    {
        "bucket": "hb",
        "key": "second",
        "vclock": "a85hYGBgzGDKBVIsLEHbN2YwJTLmsTLMPvDzCF8WAA==",
        "values": [
            {
                "metadata": {
                    "Links": [
                        ["hb","fourth","foo"]
                    ],
                    "X-Riak-VTag": "2ZKEJ2gaT57NT7xhLDPCQz",
                    "content-type": "text/plain",
                    "X-Riak-Last-Modified": "Wed, 24 Feb 2010 15:24:11 GMT",
                    "X-Riak-Meta": []
                },
                "data": "the second"
            }
        ]
    },
    {
        "bucket": "hb",
        "key": "third",
        "vclock": "a85hYGBgzGDKBVIsTKLLozOYEhnzWBn+H/h5hC8LAA==",
        "values": [
            {
                "metadata": {
                    "Links": [
                        ["hb","fourth","foo"]
                    ],
                    "X-Riak-VTag": "5Fs0VskZWx7Y25tf1oQsvS",
                    "content-type": "text/plain",
                    "X-Riak-Last-Modified": "Wed, 24 Feb 2010 15:25:51 GMT",
                    "X-Riak-Meta": []
                },
                "data": "the third"
            }
        ]
    }
]

Another interesting query is “follow only links that are tagged foo.” For that, just add a tag field to the link phase spec:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{"tag":"foo"}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

Here you should get a JSON array with one element: a JSON encoding of the hb/second object. The link to the hb/third object was tagged bar, so that link was not followed. The equivalent URL syntax is:

$ curl http://localhost:8098/riak/hb/first/_,foo,_

It’s also possible to filter links by bucket by adding a bucket field to the link phase spec, or by replacing the first underscore with a bucket name in the URL format. But, all of our example links point to the same bucket, so hb is the only interesting setting here.

Link phases may also be chained together (or put after other phases if those phases produce bucket/key lists). For example, we could follow the links all the way from hb/first to hb/fourth with:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"link":{}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

(Notice the added link phase.) If you run that, you’ll find that you get two copies of the hb/fourth object in the response. This is because we didn’t bother uniquifying the results of the link extraction, and both hb/second and hb/third link to hb/fourth. A reduce phase is fairly easy to add:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"link":{}},{"reduce":{"language":"erlang","module":"riak_mapreduce","function":"reduce_set_union"}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

The resource handling the URL link-walking format does just this:

$ curl http://localhost:8098/riak/hb/first/_,_,_/_,_,_

That should get you just one copy of the hb/fourth object.

So why choose either map/reduce or URL-syntax? The advantage of URL syntax is that if you’re starting from just one object, and just want to get the objects at the ends of the links, and you can handle multipart/mixed encoding, then URL syntax is much simpler and more compact. Map/reduce with link phases should be your choice if you want to start from multiple objects at once, or you want to get some processed or aggregated form of the objects, or you want the result to be JSON-encoded.

Riak version 0.8 note: In Riak 0.8, the format of the result of ‘link’ map/reduce phases was not able to be transformed into JSON. This meant both that it was not possible to put a Javascript reduce phase right after a link phase, and also that it was not possible to end an HTTP map/reduce query with a link phase. Those issues have been resolved in the tip of the source repository, and will be part of the 0.9 release.

-Bryan

Using Innostore with Riak

February 22, 2010

Innostore is an Erlang application that provides an API for storing and retrieving key/value data using the InnoDB storage system. This storage system is the same one used by MySQL for reliable, transactional data storage. It’s a proven, fast system and perfect for use with Riak if you have a large amount of data to store. Let’s take a look at how you can use Innostore as a backend for Riak.

(Note: I assume that you have successfully built an instance of Riak for your platform. If you built Riak from source in ~/riak, then set $RIAK to ~/riak/rel/riak.”)

We first get started by grabbing a stable release of Innostore. You’ll need to download the source for a release from: https://github.com/basho/innostore

Looking in the “Tags & snapshots” section, you should download the source for the highest available RELEASE_* tag. In my case, RELEASE_4 is the most recent release, so I’ll grab the bz2 file associated with it.

Once I have the source code, it’s time to unpack it and build:

$ tar -xjf innostore-RELEASE_4.tar.bz2

$ cd innostore

$ make

Depending on the speed of the machine you are building on, this may take a few minutes to complete. At the end, you should see a series of unit tests run, with the output ending:
=======================================================

All 7 tests passed.

100222 7:43:58 InnoDB: Shutdown completed; log sequence number 90283

Cover analysis: /Users/dizzyd/src/public/innostore/.eunit/index.html

Now that we have successfully built Innostore, it’s time to install it into the Riak distribution:

$ ./rebar install target=$RIAK/lib

If you look in the $RIAK/lib directory now, you should see the innostore-4 directory alongside a bunch of .ez files and other directories which compose the Riak release.

Now, we need to tell Riak to use the Innostore driver as a backend. Make sure Riak is not running. Edit $RIAK/etc/app.config, setting the value for “storage_backend” as follows:

{storage_backend, innostore_riak},

In addition, append the configuration for the Innostore application after the SASL section:

{sasl, [ ....

]}, %% < -- make sure you add a comma here!!

{innostore, [

{data_home_dir, "data/innodb"}, %% Where data files go

{log_group_home_dir, "data/innodb"}, %% Where log files go

{buffer_pool_size, 2147483648} %% 2G in-memory buffer in bytes

]}

You may need to adjust the directories for your data_home_dir and log_group_home_dirs to match where you want the inno data and log files to be stored. If possible, make sure that the data and log dirs are on separate disks — this can yield much better performance.

Once you’ve completed the changes to $RIAK/etc/app.config, you’re ready to start Riak:

$ $RIAK/bin/riak console

As it starts up, you should see messages from Inno that end with something like:

100220 16:36:58 InnoDB: highest supported file format is Barracuda.

100220 16:36:58 Embedded InnoDB 1.0.3.5325 started; log sequence number 45764

That’s it! You’re ready to start using Riak for storing truly massive amounts of data.

Enjoy,

Dave Smith

Calling all Rubyists – Ripple has Arrived!

February 11, 2010

The Basho Dev. Team has been very excited about working with the Ruby community for some time. The only problem was we were heads down on so many other projects that it was hard to make any progress. But, even with all that work on our plate, we were committed to showing some love to Rubyists and their frameworks.

Enter Sean Cribbs. As Sean details in his latest blog post, Basho and the stellar team at Sonian made it possible for him to hack on Ripple, a one-of-a-kind client library and object mapper for Riak. The full feature set for Ripple can be found on Sean’s blog, but highlights include a DataMapper-like API, an easy builder-style interface to Map/Reduce, and near-term integration with Rails 3.

And, in case you need any convincing that you should consider Riak as the primary datastore for your next Rails app, check out Sean’s earlier post, “Why Riak should power your next Rails app.”

So, if you’ve read enough and want to get your hands on Ripple, go check it out on GitHub.

If you don’t have Riak downloaded and built yet, get on it.

Lastly, you are going to be seeing a lot more Riak in your Ruby. So stay tuned because we have some big plans.

Best,

Mark