Tag Archives: data models

Webinar Recap and Q&A – Schema Design for Riak

December 8, 2010

Thank you to all who attended the webinar yesterday. The turnout was great, and the questions at the end were also very thoughtful. Since I didn’t get to answer very many, I’ve reviewed all of the questions below, in no particular order.

Q: Can you touch on upcoming filtering of keys prior to map reduce? Will it essentially replace the need for one to explicitly name the bucket/key in a M/R job? Does it require a bucket list-keys operation?

Key filters, in the upcoming 0.14 release, will allow you to logically select a population of keys from a bucket before running them through MapReduce. This will be faster than a full-bucket map since it only loads the objects you’re really interested in (the ones that pass the filter). It’s a great way to make use of meaningful keys that have structure to them. So yes, it does require an list-keys operation, but doesn’t replace the need to be explicit about which keys to select; there are still many useful queries that can be done when the keys are known ahead of time.

For more information on key-filters, see Kevin’s presentation on the upcoming MapReduce enhancements.

Q: How can you validate that you’ve reached a good/valid KV model when migrating a relational model?

The best way is to try out some models. The thing about schema design for Riak that turns your process on its head is that you design for optimizing queries, not for optimizing the data model. If your queries are efficient (single-key lookup as much as possible), you’ve probably reached a good model, but also weigh things like payload size, cost of updating, and difficulty manipulating the data in your application. If your design makes it substantially harder to build your application than a relational design, Riak may not be the right fit.

Q: Are there any “gotchas” when thinking of a bucket as we are used to thinking of a table?

Like tables, buckets can be used to group similar data together. However, buckets don’t automatically enforce data structure (columns with specified types, referential integrity) like relational tables do; that part is still up to your application. You can, however, add precommit hooks to buckets to perform any data validation that your application shouldn’t handle.

Q: How would you create a ‘manual index’ in Riak? Doesn’t that need to always find unique keys?

One basic way to structure a manually-created index in Riak is to have a bucket specifically for the index. Keys in this bucket correspond to the exact value you are indexing (for fuzzy or incomplete values,
use Riak Search). The objects stored at those keys have links or lists of keys that refer to the original object(s). Then you can find the original simply by following the link or using MapReduce to extract and find the related keys.

The example I gave in the webinar Q&A was indexing users by email. To create the index, I would use a bucket named users_by_email. If I wanted to lookup my own user object by email, I’d try to fetch the object
at users_by_email/sean@basho.com, then follow the link in it (something like </riak/users/237438-28374384-128>; riaktag="indexed") to find the actual data.

Whether those index values need to be unique is up to your application to design and enforce. For example, the index could be storing links to blog posts that have specific tags, in which case the index need not be unique.

To create the index, you’ll either have to perform multiple writes from your application (one for the data, one for the index), or add a commit hook to create and modify it for you.

Q: Can you compare/contrast buckets w/ Cassandra column families?

Cassandra has a very different data model from Riak, and you’ll want to consult with their experts to get a second opinion, but here’s what I know. Column families are a way to group related columns together that you will always want to retrieve together, and is something that you design up-front (it requires restarting the cluster for changes to take effect). It’s the closest thing to a relational table that Cassandra has.

Although you do use buckets to group similar data items, in contrast, Riak’s buckets:

  1. Don’t understand or enforce any internal structure of the values,
  2. Don’t need to be created or designed ahead of time, but pop into existence when you first use them, and
  3. Don’t require a restart to be used.

Q: How would part sharing be achieved? (this is a reference to the example given in the webinar, Radiant CMS)

Radiant shares content parts only when specified by the template language, and always by inheritance from ancestor pages. So if the layout contained <r:content part="sidebar" inherit="true" />, then if the currently rendering page doesn’t have that content part, it will look up the hierarchy until it finds it. This is one example of why it’s so important to have an efficient way to traverse the site hierarchy, and why I presented so many options.

Q: What is the max number of links an object can have for Link Walking?

There’s no cut-and-dry answer for this. Theoretically, you are limited only by storage space (disk and RAM) and the ability to retrieve the object from the desired interface. In a practical sense this means that the default HTTP interface limits you to around 100,000 links on a single object (based on previous discussions of the limits of HTTP packets and header lengths). Still, this is not going to be reasonable to deal with in your application. In some applications we’ve seen links on the order of hundreds per object negatively impact link-walking performance. If you need to have that many, you’ll be better off exploring other designs.

Again, thanks for attending! Look for our next webinar coming in about month.

Sean, Developer Advocate

Free Webinar – Schema Design for Riak – Dec 7 at 2PM Eastern

December 1, 2010

Moving applications to Riak involves a number of changes from the status quo of RDBMS systems, one of which is taking greater control over your schema design. You’ll have questions like: How do you structure data when you don’t have tables and foreign keys? When should you denormalize, add links, or create MapReduce queries? Where will Riak be a natural fit and where will it be challenging?

We invite you to join us for a free webinar on Tuesday, December 7 at 2:00PM Eastern Time to talk about Schema Design for Riak. We’ll discuss:

  • Freeing yourself of the architectural constraints of the “relational” mindset
  • Gaining a fuller understanding of your existing schema and its queries
  • Strategies and patterns for structuring your data in Riak
  • Tradeoffs of various solutions

We’ll address the above topics and more as we design a new Riak-powered schema for a web application currently powered by MySQL. The presentation will last 30 to 45 minutes, with time for questions at the end.

If you missed the previous version of this webinar in July, here’s your chance to see it! We’ll also use a different example this time, so even if you attended last time, you’ll probably learn something new.

Fill in the form below if you want to get started building applications on top of Riak!

Sorry, registration is closed! Video of the presentation will be posted on Vimeo after the webinar has ended.

The Basho Team