Tag Archives: key/value

Why MapReduce Is Hard

April 20, 2011

At the highest level, MapReduce can be a fundamentally hard concept to grasp. In our last blog post on MapReduce, you learned how easy it is. After all, it’s just code, right?

The story gets a bit more complicated as soon as you run MapReduce in a distributed environment like Riak. Some of these caveats are just things to be aware of, others are simply in the nature of both Riak’s implementation and the ideas in MapReduce itself. Adding to that, it’s just a hard concept to grasp. I’ll be the first to admit it took me a long time to understand.

MapReduce is hard because it’s just code

Wait, didn’t I just tell you that MapReduce is easy because it’s just code? I believe I did, Bob.

Code requires knowledge of what it does. Code requires the discipline to keep it clean and easy to grasp. Code requires knowledge of the data structures involved. Code requires knowledge of how Riak treats your MapReduce functions depending on their phase in a particular MapReduce request.

Take Riak’s map functions. Depending on whether you’re intending to chain multiple functions in one request, they must either return a bucket/key pair or the data for either the client or the following reduce function.

One way to solve this would be to add an extra argument, based on which the map function can decide whether it returns a bucket/key pair or the data. Why would you do this? Consider a scenario where you want to extract a set of objects based on a match in one attribute and then run a second map matching another attribute. Here’s an example:

javascript
var extractByMatchingAttribute = function(value, keydata, args) {
var doc = Riak.mapKeyValuesJson(value)[0];
if (doc.host.match(/basho.com/)) {
if (args.last) {
return [1];
} else {
return [[value.bucket, value.key]];
}
}
}

Now you can pass in an argument object (e.g. {last: true}) to every phase in a MapReduce request. This is a valid solution, but makes the map function painfully more aware of the environment in which it’s running.

“But hey, ” you wonder, “couldn’t I just make these into a single map function instead and return the value immediately?” You could, but depending on other MapReduce scenarios you could end up repeating a lot of code across a lot of functions. Once again, you could extract it, ignore it, or try to abstract, making map functions chainable through extra arguments, which then can be passed to every single map phase independently.

I told you, MapReduce is hard because it’s just code, didn’t I?

MapReduce is hard because it doesn’t (yet) have a common language

Wouldn’t it be nice if we could just use a simple language to express what we want our Riak cluster to do instead? Something like the following, where we have a bucket full of log entries, every one of them represented by a JSON object, which handily includes a host attribute:

sql
LOAD log_entries FROM KEY '2011-04-08' TO KEY '2011-05-10' INTO entries
FILTER entries BY host MATCHING /basho.com/ AND
GROUP entries BY host

There, three lines of easily readable text in a purely hypothetical example, although slightly inspired by Apache Pig. “Oh hey,” you’re saying, “doesn’t that look a bit like SQL?” Maybe it does, maybe it doesn’t, but that’s not the point. More important than that, it’s easier to read.

Will we see something along these lines in Riak? No comment.

MapReduce is hard because accessing lots of data is expensive

MapReduce is the technique du jour to query and filter data in Riak. It’s intended to be used as a tool to transform and analyze a set of data. You hand in a bunch of bucket/key pairs, and Riak feeds them into your map and reduce functions. Simple enough.

It has been our experience, however, that what users want is to feed the data from a whole bucket into a MapReduce request or a whole range of them. Thankfully you can do the latter using the rather nice key filters.

Still, fetching all the keys from a bucket requires a lot of resources. Key filters can certainly help to reduce it, because they fold the list of keys before the data is loaded from disk, which is certainly faster than loading everything right away. But it still takes a lot of time when there’s millions of keys stored in a Riak cluster. Not to mention the fact that every node in the cluster is involved in both fetching all objects from a bucket and running key filters.

If this is what you’re trying to achieve, you should have a look at Riak
Search
. It does a much better job at these reverse lookup queries (which some call secondary indexes), and you can neatly feed the results into a MapReduce request, something we’ve implemented in Riaktant, our little syslog collector and aggregator. But we’ll leave dissecting that code to another blog post.

We have some more tricks up our sleeve which should be appearing in the near future, so stay tuned. I’m sure it’ll blow your mind just like it blew mine.

MapReduce is hard because going distributed requires coordination

Whenever you fire off a MapReduce request in a Riak cluster, the node you’re requesting becomes the coordinator of this particular request. This poor node suddenly has the responsibility of sending your map request to all the relevant nodes in the cluster, using the the preference list where applicable.

The coordinating node suddenly has the job of sending out the map requests and waiting for them to return, triggering the timeout if it doesn’t. If the map phases return bucket/key pairs and there’s another map phase to run, it starts over, sending new requests to the nodes responsible for the pairs.

Map phases in a Riak cluster always run close to the data to reduce overhead both on network traffic and the general load in the cluster, as it greatly reduces the amount of data sent across the wire, and lets the nodes responsible for the data get some of that MapReduce action as well.

That’s a lot of work for the coordinating node, but except for the reduce part, it’s not necessarily heavy computationally. It does, however, have to track state of all the phases sent out to the cluster, collect the results and control behaviour in case of failure of just a single node in the cluster.

What can you do about this? Other than keep your initial data set for a MapReduce set as small as possible, the ball is pretty much in our court. It’s Riak’s task to make sure a MapReduce request is handled and coordinated properly.

MapReduce is hard because reduce runs recursively

Enter the brain bender called re-reduce. If you haven’t run into it, you’re in for a treat. A reduce function returns a list of aggregation results based on a list of inputs. Depending on how large the initial result set from the function is going to be, Riak’s MapReduce will feed it the first round of results again, a technique commonly called re-reduce.

In general your reduce functions should be oblivious to the kind of data they’re fed, they should work properly in both the normal and the re-reduce case. Two options here:

  • Make your reduce functions self-aware.
    They know what kind of data they get from a previous map phase, and what kind of data they return. But that means adding some sort of type checking to your reduce functions, which is cumbersome and, especially with JavaScript, not exactly beautiful and actually achievable. So let’s assume I never told you about this option.
  • Make sure they always return data in the format they receive it.
    This is the way you’ll usually end up going. It’s just a lot easier to spend a little bit more time thinking how to build a result that you can re-run the same function and the result will either be unchanged or just aggregated again in the same fashion as before.

Here’s an example: a function that groups by an attribute. It assumes a list containing JavaScript objects with attributes as keys and numbers as values, e.g. {attribute: 1}. It simply adds up the values for every object, so that a list containing three pairs of the aforementioned example turns into {attribute: 3}. This makes it nicely
ignorant of where the data comes from and re-reduce will simply be fed the same data structures as the initial reduce.

The second parameter to the function is an argument you can pass to the reduce phase, telling it which particular attribute you’re interested in, making the function nicely reuseable along the way.

javascript
function(values, field) {
var result = {};
for (value in values) {
for (field in values[value]) {
if (field in result) {
result[field] += values[value][field];
} else {
result[field] = values[value][field];
}
}
}
return [result];
}

Re-reduce is a weird concept to grasp, but when you just think of it as two different ways Riak will pass data to your reduce function, it’s actually not that hard. Just make sure a reduce function returns data in a way it can run on again, and you’re fine.

MapReduce is hard because debugging errors is hard

Finding out where things go wrong in a distributed MapReduce environment like Riak is hard. The only useful way of inspecting data you have is by using the ejsLog() function in JavaScript, and that means you’ll have to go look on every node in the cluster to find out what’s wrong or if the data is in the format you expect it to be. The following dumps the list of objects for a map or reduce function as JSON into the log file /tmp/mapreduce.log.

javascript
ejsLog('/tmp/mapreduce', JSON.stringify(values));

The small fix: test your MapReduce functions on a cluster with a single node before cranking it up a notch, or have a look at Sean Cribbs’ riak-qc.js framework for testing Riak MapReduce functions.

The MapReduce bottom line

MapReduce is hard, but we’re not going shopping just yet. Instead, we’re keeping a close eye on what our users are trying to achieve, improving Riak along the way. For more details on Riak’s MapReduce, check out the wiki, read a blog post on more practical MapReduce, or watch last year’s webinar on MapReduce. If you’re feeling adventurous, have a look at the examples and MapReduce code used in Riaktant. I’ll be sure to get back to you with a proper introduction on how it uses MapReduce to sift through syslog data.

Mathias

A Deeper Look At Riak's MapReduce Enhancements

January 6, 2010

We officially released Riak 0.14 yesterday. Some of the biggest enhancements were in and around Riak’s MapReduce functionality. Here’s a more in-depth look at what you can look forward to in 0.14 if you’re into Mapping and Reducing.

Key Filtering

Performing any type of sophisticated query on a strictly key/value store is notoriously hard. Past releases of Riak were limited to MapReduce-ing over either entire buckets or a discrete set of user-supplied inputs. The problem with these approaches is neither facilitated the kind of robust querying many applications required. For example, let’s examine an application which logs application events to a Riak bucket. Each entry is a JSON hash containing a timestamp, the user generating the event, and some information about the event. Part of our example application requires querying these log entries based on timestamp and user.

In current releases of Riak, the application would have to map over the entire bucket and examine each entry to find the relevant set. This type of query is usable when the bucket is small. But when the bucket gets bigger these types of queries begin to exhibit performance problems. Scanning the bucket and loading objects from disk only to discard them is an expensive proposition.

This is exactly the use case where key filtering can help. If the application can store meaningful data in the keys then key filtering can query just the keys and load only the objects whose keys pass the filter to be processed by the MapReduce job. For our example app we could combine the timestamp and user id to form each entry’s key like this: “1292943572.pjohnson”. Using key filters we can locate all the user entries for “pjohnson” between “12/1/2010″ and “12/7/2010 “and count them via MapReduce:

“`javascript
{“inputs”: {
“bucket”: “user_logs”,
“key_filters”: [[“and”, [["tokenize", ".", 1],
["string_to_int"],
["between", 1291161600, 1291852799]],
[["tokenize", ".", 2],
["matches", "pjohnson"]]]]},
“query”: [{“map”:{
“language”: “javascript”,
“source”: “function(obj) { return [1]; }”}},
“reduce”:{
“language”: “javascript”,
“name”: “Riak.reduceSum”}]}
“`

Key filtering will support boolean operators (and, or, not), url decoding, string tokenizing, regular expressions, and various string to numeric conversions. Client support will initially be limited to the Java and Ruby clients (and Python support is already being attacked by the community). More clients will be added in subsequent releases.

Next Steps

MapReduce Query Planner

One of the biggest obstacles to improving Riak’s MapReduce performance was the way map functions were scheduled around the cluster. The original implementation was fairly naive and scheduled map functions around the vnodes in the order listed in the replica list for each bucket/key pair. This approach resulted in vnode hotspots, especially in smaller clusters, as many bucket/key pairs would hash to the same vnode. We also sent each bucket/key pair to be mapped in a separate Erlang message which reduced throughput on larger jobs as they wound up generating significant messaging traffic.

The new planner addresses many of these problems. Each batch of 50 bucket/key pairs are analyzed and scheduled around the cluster to maximize vnode coverage. In other words, the planner schedules many bucket/key pairs onto a common vnode in a single message. This reduces the chattiness of jobs overall and also improves throughput as the underlying map dispatch code can operate in batches rather than single values.

Segregated Javascript VM Pools

Contention for Javascript VMs in a busy cluster can be a significant source of performance problems. The contention is caused by each cluster node having a single pool of Javascript VMs for all Javascript calls: map functions, reduce functions, and pre-commit hooks.

0.14 supports three separate pools of Javascript VMs to reduce overall contention. By tweaking a few lines of code in your app.config file, users will be able to tailor the size of each pool to their particular needs. Does your app use MapReduce and ignore hooks? Turn the hook pool size down to zero and save yourself some CPU and memory. Do you always submit MapReduce jobs to a particular node in the cluster? You can bump up the reduce pool size on the node receiving the jobs while setting it to zero on the other nodes. This uses the fact that reduce phases aren’t distributed to use resources where they are most needed in the cluster.

As you can see, we’ve put a lot of work into refining MapReduce in the latest release, and we’re dedicated to continuing this work in upcoming releases. If you want to get your hands dirty with MapReduce right now, check out:

Enjoy!

The Basho Team

From Relational to Riak (Webcast)

**January 02, 2013**

New to Riak? Thinking about using Riak instead of a relational database? Join Basho chief architect Andy Gross and director of product management Shanley Kane for an intro this Thursday (11am PT/2pm ET). In about 30 minutes, we’ll cover the basics of:

* Scalability benefits of Riak, including an examination of limitations around master/slave architectures and sharding, and what Riak does differently
* A look at the operational aspects of Riak and where they differ from relational approaches
* Riak’s data model and benefits for developers, as well as the tradeoffs and limitations of a key/value approach
* Migration considerations, including where to start when migrating existing apps
* Riak’s eventually consistent design
* Multi-site replication options in Riak

Register for the webcast [here](http://info.basho.com/RelationalToRiakJan3.html).

[Shanley](http://twitter.com/shanley)

[Andy](https://twitter.com/argv0)

Basho is Taking Over Baltimore This Weekend

September 29, 2010

Basho hackers will be giving quite a few presentations between now and the end of the week. And they all happen to be in Baltimore! Here is a quick rundown (in no particular order) of where we will be, who will be there, and what we will be talking about:

Rusty Klophaus at CUFP

Rusty Klophaus will be at the Commercial Users of Functional Programming (CUFP) Event taking place this weekend in Baltimore, Maryland. His talk is called “Riak Core: Building Distributed Applications Without Shared State” and it should be downright amazing.

From the talk’s description: Both Riak KV (a key-value datastore and map/reduce platform) and Riak Search (a Solr-compatible full-text search and indexing engine) are built around a library called Riak Core that manages the mechanics of running a distributed application in a cluster without requiring a central coordinator or shared state. Using Riak Core, these applications can scale to hundreds of servers, handle enterprise-sized amounts of data, and remain operational in the face of server failure.

All the details on his talk can be found here.

And, in case you haven’t been following your Riak Core developments, check out Building Distributed Systems with Riak Core.

Dave Smith at the Ninth ACM SIGPLAN Erlang Workshop

Dave Smith (a.k.a “Dizzyd”) will be keynoting the ACM SIGPLAN Erlang Workshop taking place on Friday, Sept 30th, also in Baltimore. Dave’s talk is called “Rebar, Bitcask and how chemotherapy made me a better developer.”

Rebar and Bitcask are both pieces of software that Dizzy had a major hand in creating and they have played a huge role in Riak’s adoption (not to mention that Rebar has quickly become an indispensable tool for Erlang developers everywhere). Dave was also fighting follicular lymphoma while writing a lot of this code. Needless to say, this one is sure to be memorable and of immense value.

More details on his talk can be found here.

It should also be noted that newly-minted Basho Developer Scott Fritchie is the Workshop Chair for this event. He is an accomplished Erlang developer and Riak is not the only distributed key/value store about which Scott is passionate – he will also happily talk your ear off about Hibari.

Justin Sheehy at Surge

Just when you thought they couldn’t fit another conference in Baltimore on the same weekend… And this one is big: it’s the Surge Conference put on by the team at OmniTI. Basho will be there in the form of CTO Justin Sheehy.

Justin will be giving a talk about concurrency at scale, something about which every distributed systems developer should care deeply. Additionally, he will be taking part in a panel discussion – I haven’t seen an official name for it yet but rumor has it that it’s something along the lines of “SQL versus NoSQL.”

Check out the Surge site for more conference details.

As you can see, Baltimore is the place to be this weekend. Get there at all costs. And then go download Riak.

Mark

Free Webinar – Riak in Action – Wriaki – August 19 at 2PM

August 13, 2010

Documentation is great, but playing with examples can also be a helpful way to tackle steep learning curves. To help you learn about ways of using Riak, we’d like to present “Wriaki”, an example implementation of a wiki that stores its data in Riak.

We invite you to join us for a free webinar on Thursday, August 19 at 2:00PM Eastern Time (UTC-4) about Riak in Action: Wriaki. During the presentation, Bryan Fink will cover:

  • Modeling wiki data in the Riak key/value store
  • Access patterns using both get/put and map/reduce
  • Three strategies that Wriaki uses for dealing with eventual consistency
  • how the user interface changes to accommodate Wriaki’s models

The code for Wriaki will be open-source at the time of the presentation. The presentation will last 30 to 45 minutes, with time for questions at the end. Fill in the form below to reserve your seat!  Sorry, registration has closed!

If you cannot attend, the video and slides will be made available afterward in the recap post on the blog.

Webinar Recap – Schema Design for Riak

July 7, 2010

Thank you to all who attended the webinar yesterday. The turnout was great, and the questions at the end were also very thoughtful. Since I didn’t get to answer very many, I’ve reviewed the questions below, in no particular order. If you want to review the slides from yesterday’s presentation, they’re on Slideshare.

Q: You say listing keys is expensive. How are Map phases affected? Does the number of keys in a bucket have an effect on the expense of the operation? (paraphrased)

Listing keys (for a single bucket, there is no analog for the entire system) requires traversing the entire keyspace, even examining keys that don’t belong to the requested bucket. If your Map/Reduce query uses a whole bucket as its inputs, it will be nearly as expensive as listing keys back to the client; however, Map phases are executed in parallel on the nodes where the data lives, so you get the full benefits of parallelism and data-locality when it executes. The expense of listing keys is taken before any Map phase begins.

It bears reiterating that the expense of listing keys is proportional to the total number of keys stored (regardless of bucket). If your bucket has only 10 keys and you know what they are, it will probably be more efficient to list them as the inputs to your Map/Reduce query than to use the whole bucket as an input.

Q: How do you recommend modeling relationships that require a large number of associations (thousands or millions)?

This is difficult to do, and I won’t say there’s an easy or best answer. One idea that came up in the IRC
room after the webinar was building a B-tree-like data-structure that could be grown to fit the number of associations. This solves the one-to-many relationship, but will require extra handling and care on the part of your application. In some cases, where you only need to know membership in the relationship, a bloom filter might be appropriate. If you must model lots of highly-connected data, consider throwing a graph database in the mix. Riak is not going to fit all use-cases, some models will be awkward.

Q: My company provides a Java web application and analytics solution that uses JDO to persist to and query from a relational database. Where would I start in integrating with Riak?

Since I haven’t done Java in a serious way for a long time, I can’t speak to the specifics of JDO, or how you might work on migrating away from it. However, I have found that most ORMs hide things from the
developer that he/she should really be aware of — how the mapping is performed, what queries are executed, etc. You’ll likely have to look into the guts of how JDO persists and retrieves objects from the database, then step back and reevaluate what your top queries are and how Riak can help improve or simplify those operations. This is all in the theme of the webinar: Know your data!

Q: Is the source code for the example application and schema design available? (paraphrased)

No, there isn’t any sample code yet. You can play with the existing application (Lowdown) at lowdownapp.com. The other authors and I are seeking a few people to take over its development, and the initial group we contacted have indicated it will be open-sourced.

Q: Is there an way to get notified on changes in a bucket?

That’s not built-in to Riak. However, you could write a post-commit hook in Erlang that pushes a notification to RabbitMQ, for example, then have the interested parties consume messages from that queue.

Q: What mechanism does Riak have to deal with the unique user issue?

Riak has neither write locks nor transactions. There is no way to absolutely guarantee uniqueness without introducing an intermediary that acts as a single-arbiter (and point-of-failure). However, in cases when you aren’t experiencing high write-concurrency on the data in question there are a few things you can do to simulate the uniqueness constraint:

  • Check for existence of the key before writing. In HTTP, this is as simple as a HEAD request. If the response is 404 Not Found, the object probably doesn’t exist.
  • Use a conditional PUT (in HTTP) when creating the object. The If-None-Match: * header should prevent you from blindly overwriting an existing key.

Neither of these solutions are bullet-proof because all operations happen in Riak asynchronously. Remember that it’s eventually consistent, meaning that not all parts of the system may agree at all times, but they will converge on a single state over time. There will be corner-cases where a key doesn’t exist when you check for it, the write via the conditional request succeeds, and you still end up creating an object in conflict. Caveat emptor.

Q: Are the intermediate results of Link and Map phases cached?

Yes, the results of both map and link phases are cached in a pretty naive LRU. The development team has plans to improve its behavior in future versions of Riak.

Q: Could you comment on commit hooks and what place they have, if any, in riak schema design? Would it make sense to use hooks to build an index e.g. keys in a bucket?

Yes, commit hooks are very useful in schema design. For example, you could use a pre-commit hook to validate the format of data before it’s stored. You could use post-commit hooks to send the data to external services (see above) or, as you suggest, build an index in another bucket. Building a secondary index reliably is complicated though, and it’s something I want to work on over the next few months.

Q: So if you have allow_mult=false are there cases where riak will return a conflict 409? Is the default that last write wins?

Riak never returns a 409 Conflict status from the HTTP interface on writes. If you supply a conditional header (If-Match, for example) you might get a 412 Precondition Failed response if the ETag of the object to be modified doesn’t match the header. In general, it is Riak’s policy to accept writes regardless of the internal state of the object.

The “last write wins” behavior comes in two flavors: “clobbering” writes, and softer “show me the latest one” reads. The latter is the default behavior, in which siblings might occur internally (and the vector clock grown) but not exposed to the client; instead it returns the sibling with the latest timestamp at read/GET time and “throws away” new writes that are based on older (ancestor) vclocks. The former actually ignores vector clocks for the specified bucket, providing no guarantees of causal ordering of writes. To turn this behavior on, set the last_write_wins bucket property to true. Except in the most extreme cases where you don’t mind clobbering things that were written since the last time you read, we recommend using the default behavior. If you set allow_mult=true, conflicting writes (objects with divergent vector clocks, not traceable descendents) will be exposed to the client with a 300 Multiple response.

Again, thanks for attending! Look for our next webinar in about two weeks.

Sean

Free Webinar – Schema Design for Riak – July 8th at 2PM Eastern

June 30, 2010

Moving applications to Riak involves a number of changes from the status quo of RDBMS systems, one of which is taking greater control over your schema design. You’ll have questions like: How do you structure data when you don’t have tables and foreign keys? When should you denormalize, add links, or create map-reduce queries? Where will Riak be a natural fit and where will it be challenging?

We invite you to join us for a free webinar on Thursday, July 8 at 2:00PM Eastern Time to talk about Schema Design for Riak. We’ll discuss:

  • Freeing yourself of the architectural constraints of the “relational” mindset
  • Gaining a fuller understanding of your existing schema and its queries
  • Strategies and patterns for structuring your data in Riak
  • Tradeoffs of various solutions

We’ll address the above topics and more as we design a new Riak-powered schema for a web application currently powered by MySQL. The presentation will last 30 to 45 minutes, with time for questions at the end.

Fill in the form below if you want to get started building applications on top of Riak!

Sorry, registration is closed.

The Basho Team

Introducing the Riak Fast Track

May 4, 2010

Our Challenge

There is nothing easy about making software simple to learn and understand. Every potential user has different nuances to their learning styles, and this makes for a hard road to simple usage. This is especially true with Riak.

Internally at Basho, we are constantly addressing questions like, “How do we make a ‘distributed, Dynamo-inspired key/value store’ inviting and less daunting to first time users?” and “How do we lower the barrier to adoption and usage?” Though resources like the Riak Mailing List, the Riak Wiki, and Riak IRC channel are great, we kept asking ourselves, “What can we do to make it dead simple for those new to and interested in Riak to learn about it and how it works?”

Our answer (in part) is the Riak Fast Track.

What is the Riak Fast Track?

The Fast Track is an interactive module on the Riak Wiki that, through a combination of concise content and brief screencasts, will bring you up to speed on a) what Riak is, b) what its key features and benefits are, and c) how to use it.

As I stated above, the Fast Track is aimed at developers who may be new to Riak or those who may have heard about it in passing but haven’t spent too much time fiddling with it.

Is it exhaustive? No. Will you be an Riak expert after an hour? No. But, at the end of it, you should be able to tell your friends that you performed a JavaScript MapReduce query on historical stock data distributed over a three node Riak cluster on you local machine. If that’s not cool then I don’t know what is!

Your Challenge

We put a lot of time into making this, but there are undoubtedly some kinks that need to be worked out. And, regardless of how long we try to tweak and refine it, there will always be some small aspects and details that we aren’t going to get right. It is for that reason that we are appealing to you, the rapidly-growing Riak Community, to help us.

So, here is the challenge: Take 45 minutes and go through the Riak Fast Track. Then, when you’re done, take five minutes to write us an email and tell us what you thought about it. That’s it.

We are looking for answers to questions like:

  • Was it effective?
  • Did you learn anything?
  • What did we get right?
  • What did we get wrong?
  • What should we add/remove?

And, to sweeten the pot, we are going to send a “Riak Swag Pack” (contents of which are top secret) to everyone who sends us their review and thoughts on the Fast Track by the close of business on Tuesday (5/11) of next week. It doesn’t have to be anything extensive (though we love details). A simple, “I liked x, y, and z, but you could have done this better” would suffice. You can send your emails to mark@basho.com. I am looking forward to hearing from you!

So, without further ado, go forth and test out the Riak Fast Track.

We hope you’ll find it useful and we’re looking forward to your thoughts on how to make it better.

Best,

Mark Phillips

Hello, Bitcask

April 27, 2010

because you needed another local key/value store

One aspect of Riak that has helped development to move so quickly is pluggable per-node storage. By allowing nearly anything k/v-shaped to be used for actual persistence, progress on storage engines can occur in parallel with progress on the higher-level parts of the system.

Many such local key/value stores already exist, such as Berkeley DB, Tokyo Cabinet, and Innostore.

There are many goals we sought when evaluating which storage engines to use in Riak, including:

  • low latency per item read or written
  • high throughput, especially when writing an incoming stream of random items
  • ability to handle datasets much larger than RAM w/o degradation
  • crash friendliness, both in terms of fast recovery and not losing data
  • ease of backup and restore
  • a relatively simple, understandable (and thus supportable) code
    structure and data format
  • predictable behavior under heavy access load or large volume
  • a license that allowed for easy default use in Riak

Achieving some of these is easy. Achieving them all is less so.

None of the local key/value storage systems available (including but not limited to those written by us) were ideal with regard to all of the above goals. We were discussing this issue with Eric Brewer when he had a key insight about hash table log merging: that doing so could potentially be made as fast or faster than LSM-trees.

This led us to explore some of the techniques used in the log-structured file systems first developed in the 1980s and 1990s in a new light. That exploration led to the development of bitcask, a storage system that meets all of the above goals very well. While bitcask was originally developed with a goal of being used under Riak, it was also built to be generic and can serve as a local key/value store for other applications as well.

If you would like to read a bit about how it works, we’ve produced a short note describing bitcask’s design that should give you a taste. Very soon you should be able to expect a Riak backend for bitcask, some improvements around startup speed, information on tuning the timing of merge and fsync operations, detailed performance analysis, and more.

In the meantime, please feel free to give it a try!

- Justin and Dizzy

Schema Design in Riak – Relationships

March 25, 2010

In the previous installment we looked at how your reasons for picking Riak affect how your schema should be designed, and how you might go about structuring your data at the individual object level. In this post we’ll look at how to design relationships on top of Riak.

Relationships? I thought Riak was key-value.

An even mildly-complicated application is going to have more than one type of data to store and manipulate. Those data are not islands, but have relationships to one another that make your application and its domain more than just arbitrary lists of things.

Yes, at its core, Riak is a key-value store or distributed hash-table. Because key-value stores are not very sophisticated at modeling more complicated relationships, Riak adds the concept of links between objects that are qualified by “tags” and can be easily queried using “link-walking”.

Now, the knee-jerk reaction would be to start adding links to everything. I want to show you that the problem of modeling relationships is a little more nuanced than just linking everything together, and that there are many ways to express the same relationship — each having tradeoffs that you need to consider.

Key correspondence

The easiest way to establish a relationship is to have some correspondence between the keys of the items. This works well for one-to-one and some one-to-many relationships and is easy to understand.


In the simplest case, your related objects have the same key, but different buckets. Lookups on this type of relationship are really efficient, you just change the bucket name to find the other item. How is this useful? Why not just store them together? One of the objects may get updated or read more often than the other. Their data types might be incompatible (a user profile and its avatar, for example). In either case, you get the benefit of the separation and fast access without needing link-walking or map-reduce; however, you really can only model one-to-one relationships with this pattern.

For one-to-many types of relationships, you might prefix or otherwise derive the key of the dependent (many) side of the relationship with the key of the parent side. This could be done as part of the bucket name, or as a simple prefix to the key. There are a couple of important tradeoffs to consider here. If you choose the bucket route, the number of buckets might proliferate in proportion to your data quantity. If you choose to prefix the key, it will be easy to find the parent object, but may be more difficult to find the dependent objects. The same reasons as having equivalent keys apply here — tight cohesion between the objects but different access patterns or internal structure.

De-normalization / Composition

A core principle in relational schema design is factoring your relations so that they achieve certain “normal forms”, especially in one-to-many sorts of relationships. This means that if your domain concept “has” any number of something else, you’ll make a separate table for that thing and insert a foreign key that points back to the owner. De-normalizing (or composing) your data often makes sense, both for the sake of performance and for ease of modeling.

How does this work? Let’s say your relational database had tables for people and for addresses. A person may have any number of addresses for home, work, mailing, etc, which are related back to the person by way of foreign key. In Riak, you would give your person objects an “addresses” attribute, in which you would store a list or hash of their addresses. Because the addresses are completely dependent on the person, they can be a part of the person object. If addresses are frequently accessed at the same time as the person, this also results in fewer requests to the database.

Composition of related data is not always the best answer, even when a clear dependency exists; take for instance, the Twitter model. Active users can quickly accrue thousands of tweets, which need to be aggregated in different combinations across followers’ timelines. Although the tweet concept is dependent on the user, it has more conceptual weight than the user does and needs to stand by itself. Furthermore, performance would suffer if you had to pull all of a user’s tweets every time you wanted to see their profile data.

Good candidates for composition are domain concepts that are very dependent on their “owner” concept and are limited in number. Again, knowing the shape of your data and the access pattern are essential to making this decision.

Links

Links are by far the most flexible (and popular) means for modeling relationships in Riak, and it’s obvious to see why. They hold the promise of giving a loose graph-like shape to your relatively flat data and can cleanly represent any cardinality of relationship. Furthermore, link-walking is a really attractive way to quickly do queries that don’t need the full power of map-reduce (although Riak uses map-reduce behind the scenes to traverse the links). To establish a relationship, you simply add a link on the object to the other object.

Intrinsically, links have no notion of cardinality; establishing that is entirely up to your application. The primary difference is whether changing an association replaces or adds/removes links from the associated objects. Your application will also have to do some accounting about which objects are related to other objects, and establish links accordingly. Since links are uni-directional, stored on the source, and incoming links are not automatically detected, your application will need to add the reciprocal links when traversals in both directions are needed (resulting in multiple PUT operations). In some cases, especially in one-to-many relationships where the “many” side is not accessed independently, you might not need to establish the reciprocal link. Knowing how your data will be accessed by the application — both reads and writes — will help you decide.

Links have a few other limitations that you will need to consider. First, although the tag part of the link can technically be any Erlang term, using anything other than a binary string may make it difficult for HTTP-based clients to deal with them. Second, since links are stored directly with the object in its metadata, objects that have many links will be slower to load, store, and perform map-reduce queries over. In the HTTP/REST interface as well, there are practical limitations simply because of the method of transport. At the time of writing, mochiweb — the library that is the foundation of webmachine, Riak’s HTTP interface — uses an 8K buffer for incoming requests and limits the request to 1000 header fields (including repeated headers). This means that each Link: header you provide needs to be less than 8K in length, and assuming you use the typical headers when storing, you can have at most about 995 individual Link: headers. By the time you reach the approximately 150,000 links that that provides, you’ll probably want to consider other options anyway.

Hybrid solutions

At this point, you might be wondering how your data is going to fit any of these individual models. Luckily, Riak is flexible, so you can combine them to achieve a schema that best fits your need. Here’s a few possibilities.

Often, either the number of links on an object grows large or the need to update them independently of the source object arises. In our Twitter example, updating who you follow is a significantly different operation from updating your user profile, so it makes sense to store those separately, even though they are technically a relationship between two users. You might have the user profile object and list of followed users as key-correspondent objects, such as users/seancribbs and following/seancribbs (not taking into account your followers, of course).

In relational databases you typically use the concept of a “join table” to establish many-to-many relationships. The intermediary table holds foreign keys back to the associated objects, and each row represents one individual association, essentially an “adjacency list”. As your domain becomes more complex and nuanced, you might find that these relationships represented by join tables become domain concepts in their own right, with their own attributes. In Riak, you might initially establish many-to-many relationships as links on both sides. Similarly to the “join table” issue, the relationship in the middle might deserve an object of its own. Some examples that might warrant this design: qualified relationships (think “friends” on Facebook, or permissions in an ACL scheme), soft deletion, and history (tracking changes).

Key correspondence, composition and linking aren’t exclusive ways to think of relationships between data in your application, but tools to establish the semantics your domain requires. I’ve said it many times already, but carefully evaluate the shape of your data, the semantics you want to impose on it, and the operational profile of your application when choosing how you structure your data in Riak.

Sean Cribbs