December 8, 2010

Thank you to all who attended the webinar yesterday. The turnout was great, and the questions at the end were also very thoughtful. Since I didn’t get to answer very many, I’ve reviewed all of the questions below, in no particular order.

Q: Can you touch on upcoming filtering of keys prior to map reduce? Will it essentially replace the need for one to explicitly name the bucket/key in a M/R job? Does it require a bucket list-keys operation?

Key filters, in the upcoming 0.14 release, will allow you to logically select a population of keys from a bucket before running them through MapReduce. This will be faster than a full-bucket map since it only loads the objects you’re really interested in (the ones that pass the filter). It’s a great way to make use of meaningful keys that have structure to them. So yes, it does require an list-keys operation, but doesn’t replace the need to be explicit about which keys to select; there are still many useful queries that can be done when the keys are known ahead of time.

For more information on key-filters, see Kevin’s presentation on the upcoming MapReduce enhancements.

Q: How can you validate that you’ve reached a good/valid KV model when migrating a relational model?

The best way is to try out some models. The thing about schema design for Riak that turns your process on its head is that you design for optimizing queries, not for optimizing the data model. If your queries are efficient (single-key lookup as much as possible), you’ve probably reached a good model, but also weigh things like payload size, cost of updating, and difficulty manipulating the data in your application. If your design makes it substantially harder to build your application than a relational design, Riak may not be the right fit.

Q: Are there any “gotchas” when thinking of a bucket as we are used to thinking of a table?

Like tables, buckets can be used to group similar data together. However, buckets don’t automatically enforce data structure (columns with specified types, referential integrity) like relational tables do; that part is still up to your application. You can, however, add precommit hooks to buckets to perform any data validation that your application shouldn’t handle.

Q: How would you create a ‘manual index’ in Riak? Doesn’t that need to always find unique keys?

One basic way to structure a manually-created index in Riak is to have a bucket specifically for the index. Keys in this bucket correspond to the exact value you are indexing (for fuzzy or incomplete values,
use Riak Search). The objects stored at those keys have links or lists of keys that refer to the original object(s). Then you can find the original simply by following the link or using MapReduce to extract and find the related keys.

The example I gave in the webinar Q&A was indexing users by email. To create the index, I would use a bucket named users_by_email. If I wanted to lookup my own user object by email, I’d try to fetch the object
at users_by_email/sean@basho.com, then follow the link in it (something like </riak/users/237438-28374384-128>; riaktag="indexed") to find the actual data.

Whether those index values need to be unique is up to your application to design and enforce. For example, the index could be storing links to blog posts that have specific tags, in which case the index need not be unique.

To create the index, you’ll either have to perform multiple writes from your application (one for the data, one for the index), or add a commit hook to create and modify it for you.

Q: Can you compare/contrast buckets w/ Cassandra column families?

Cassandra has a very different data model from Riak, and you’ll want to consult with their experts to get a second opinion, but here’s what I know. Column families are a way to group related columns together that you will always want to retrieve together, and is something that you design up-front (it requires restarting the cluster for changes to take effect). It’s the closest thing to a relational table that Cassandra has.

Although you do use buckets to group similar data items, in contrast, Riak’s buckets:

  1. Don’t understand or enforce any internal structure of the values,
  2. Don’t need to be created or designed ahead of time, but pop into existence when you first use them, and
  3. Don’t require a restart to be used.

Q: How would part sharing be achieved? (this is a reference to the example given in the webinar, Radiant CMS)

Radiant shares content parts only when specified by the template language, and always by inheritance from ancestor pages. So if the layout contained <r:content part="sidebar" inherit="true" />, then if the currently rendering page doesn’t have that content part, it will look up the hierarchy until it finds it. This is one example of why it’s so important to have an efficient way to traverse the site hierarchy, and why I presented so many options.

Q: What is the max number of links an object can have for Link Walking?

There’s no cut-and-dry answer for this. Theoretically, you are limited only by storage space (disk and RAM) and the ability to retrieve the object from the desired interface. In a practical sense this means that the default HTTP interface limits you to around 100,000 links on a single object (based on previous discussions of the limits of HTTP packets and header lengths). Still, this is not going to be reasonable to deal with in your application. In some applications we’ve seen links on the order of hundreds per object negatively impact link-walking performance. If you need to have that many, you’ll be better off exploring other designs.

Again, thanks for attending! Look for our next webinar coming in about month.

Sean, Developer Advocate