Tag Archives: link walking

Links and Link Walking on the Riak Fast Track

September 3, 2010

Links and link walking are two very powerful concepts in Riak. If used appropriately within an application, they can help to add another level of relationships to your data that are typically not possible when using a key/value store as your primary storage engine. It’s for this reason that we thought new users should know how to use them.

Today we are adding a Links and Link Walking tutorial to the Riak Fast Track. In about 20 minutes, you will learn what links are and how to use them. And, to cap it all off, there is a 12 minute screencast put together by Sean Cribbs to give you a more in depth look at what links are all about.We think this is pretty cool stuff and we have a feeling you’ll enjoy it, too.

As usual, we love, want and need feedback, so send an email to mark@basho.com with any questions or comments you might have. (I’ve also been known to send t-shirts in exchange for thoughts on how to make Riak and our documentation better…)

Enjoy, and thanks for using Riak!

Mark

Schema Design in Riak – Relationships

March 25, 2010

In the previous installment we looked at how your reasons for picking Riak affect how your schema should be designed, and how you might go about structuring your data at the individual object level. In this post we’ll look at how to design relationships on top of Riak.

Relationships? I thought Riak was key-value.

An even mildly-complicated application is going to have more than one type of data to store and manipulate. Those data are not islands, but have relationships to one another that make your application and its domain more than just arbitrary lists of things.

Yes, at its core, Riak is a key-value store or distributed hash-table. Because key-value stores are not very sophisticated at modeling more complicated relationships, Riak adds the concept of links between objects that are qualified by “tags” and can be easily queried using “link-walking”.

Now, the knee-jerk reaction would be to start adding links to everything. I want to show you that the problem of modeling relationships is a little more nuanced than just linking everything together, and that there are many ways to express the same relationship — each having tradeoffs that you need to consider.

Key correspondence

The easiest way to establish a relationship is to have some correspondence between the keys of the items. This works well for one-to-one and some one-to-many relationships and is easy to understand.


In the simplest case, your related objects have the same key, but different buckets. Lookups on this type of relationship are really efficient, you just change the bucket name to find the other item. How is this useful? Why not just store them together? One of the objects may get updated or read more often than the other. Their data types might be incompatible (a user profile and its avatar, for example). In either case, you get the benefit of the separation and fast access without needing link-walking or map-reduce; however, you really can only model one-to-one relationships with this pattern.

For one-to-many types of relationships, you might prefix or otherwise derive the key of the dependent (many) side of the relationship with the key of the parent side. This could be done as part of the bucket name, or as a simple prefix to the key. There are a couple of important tradeoffs to consider here. If you choose the bucket route, the number of buckets might proliferate in proportion to your data quantity. If you choose to prefix the key, it will be easy to find the parent object, but may be more difficult to find the dependent objects. The same reasons as having equivalent keys apply here — tight cohesion between the objects but different access patterns or internal structure.

De-normalization / Composition

A core principle in relational schema design is factoring your relations so that they achieve certain “normal forms”, especially in one-to-many sorts of relationships. This means that if your domain concept “has” any number of something else, you’ll make a separate table for that thing and insert a foreign key that points back to the owner. De-normalizing (or composing) your data often makes sense, both for the sake of performance and for ease of modeling.

How does this work? Let’s say your relational database had tables for people and for addresses. A person may have any number of addresses for home, work, mailing, etc, which are related back to the person by way of foreign key. In Riak, you would give your person objects an “addresses” attribute, in which you would store a list or hash of their addresses. Because the addresses are completely dependent on the person, they can be a part of the person object. If addresses are frequently accessed at the same time as the person, this also results in fewer requests to the database.

Composition of related data is not always the best answer, even when a clear dependency exists; take for instance, the Twitter model. Active users can quickly accrue thousands of tweets, which need to be aggregated in different combinations across followers’ timelines. Although the tweet concept is dependent on the user, it has more conceptual weight than the user does and needs to stand by itself. Furthermore, performance would suffer if you had to pull all of a user’s tweets every time you wanted to see their profile data.

Good candidates for composition are domain concepts that are very dependent on their “owner” concept and are limited in number. Again, knowing the shape of your data and the access pattern are essential to making this decision.

Links

Links are by far the most flexible (and popular) means for modeling relationships in Riak, and it’s obvious to see why. They hold the promise of giving a loose graph-like shape to your relatively flat data and can cleanly represent any cardinality of relationship. Furthermore, link-walking is a really attractive way to quickly do queries that don’t need the full power of map-reduce (although Riak uses map-reduce behind the scenes to traverse the links). To establish a relationship, you simply add a link on the object to the other object.

Intrinsically, links have no notion of cardinality; establishing that is entirely up to your application. The primary difference is whether changing an association replaces or adds/removes links from the associated objects. Your application will also have to do some accounting about which objects are related to other objects, and establish links accordingly. Since links are uni-directional, stored on the source, and incoming links are not automatically detected, your application will need to add the reciprocal links when traversals in both directions are needed (resulting in multiple PUT operations). In some cases, especially in one-to-many relationships where the “many” side is not accessed independently, you might not need to establish the reciprocal link. Knowing how your data will be accessed by the application — both reads and writes — will help you decide.

Links have a few other limitations that you will need to consider. First, although the tag part of the link can technically be any Erlang term, using anything other than a binary string may make it difficult for HTTP-based clients to deal with them. Second, since links are stored directly with the object in its metadata, objects that have many links will be slower to load, store, and perform map-reduce queries over. In the HTTP/REST interface as well, there are practical limitations simply because of the method of transport. At the time of writing, mochiweb — the library that is the foundation of webmachine, Riak’s HTTP interface — uses an 8K buffer for incoming requests and limits the request to 1000 header fields (including repeated headers). This means that each Link: header you provide needs to be less than 8K in length, and assuming you use the typical headers when storing, you can have at most about 995 individual Link: headers. By the time you reach the approximately 150,000 links that that provides, you’ll probably want to consider other options anyway.

Hybrid solutions

At this point, you might be wondering how your data is going to fit any of these individual models. Luckily, Riak is flexible, so you can combine them to achieve a schema that best fits your need. Here’s a few possibilities.

Often, either the number of links on an object grows large or the need to update them independently of the source object arises. In our Twitter example, updating who you follow is a significantly different operation from updating your user profile, so it makes sense to store those separately, even though they are technically a relationship between two users. You might have the user profile object and list of followed users as key-correspondent objects, such as users/seancribbs and following/seancribbs (not taking into account your followers, of course).

In relational databases you typically use the concept of a “join table” to establish many-to-many relationships. The intermediary table holds foreign keys back to the associated objects, and each row represents one individual association, essentially an “adjacency list”. As your domain becomes more complex and nuanced, you might find that these relationships represented by join tables become domain concepts in their own right, with their own attributes. In Riak, you might initially establish many-to-many relationships as links on both sides. Similarly to the “join table” issue, the relationship in the middle might deserve an object of its own. Some examples that might warrant this design: qualified relationships (think “friends” on Facebook, or permissions in an ACL scheme), soft deletion, and history (tracking changes).

Key correspondence, composition and linking aren’t exclusive ways to think of relationships between data in your application, but tools to establish the semantics your domain requires. I’ve said it many times already, but carefully evaluate the shape of your data, the semantics you want to impose on it, and the operational profile of your application when choosing how you structure your data in Riak.

Sean Cribbs

Link Walking By Example

February 24, 2010

Riak has a notion of “links” as part of the metadata of its objects. We talk about traversing, or “walking”, links, but what do the queries for doing so actually look like?

Let’s put four objects in riak:

  1. hb/first will link to hb/second and hb/third
  2. hb/second will link to hb/fourth
  3. hb/third will also link to hb/fourth
  4. hb/fouth doesn’t link anywhere
$ curl -X PUT -H "content-type:text/plain" 
  -H "Link: </riak/hb/second>; riaktag="foo", </riak/hb/third>; riaktag="bar"" 
  http://localhost:8098/riak/hb/first --data "hello"

$ curl -X PUT -H "content-type: text/plain" 
  -H "Link:</riak/hb/fourth>; riaktag="foo"" 
  http://localhost:8098/riak/hb/second --data "the second"

$ curl -X PUT -H "content-type: text/plain" 
  -H "Link:</riak/hb/fourth>; riaktag="foo"" 
  http://localhost:8098/riak/hb/third --data "the third"

$ curl -X PUT -H "content-type: text/plain" 
  http://localhost:8098/riak/hb/fourth --data "the fourth"

Now, say we wanted to start at hb/first, and follow all of its outbound links. The easiest way to do this is with the link-walker URL syntax:

$ curl http://localhost:8098/riak/hb/first/_,_,_

The response will be a multipart/mixed body with two parts: the hb/second object in one, and the hb/third object in the other:

--N2gzGP3AY8wpwdQY0jio62L9nJm
Content-Type: multipart/mixed; boundary=3ai6VRl4aLli3dKw8tG9unUeznT

--3ai6VRl4aLli3dKw8tG9unUeznT
X-Riak-Vclock: a85hYGBgzGDKBVIsTKLLozOYEhnzWBn+H/h5hC8LAA==
Location: /riak/hb/third
Content-Type: text/plain
Link: </riak/hb>; rel="up", </riak/hb/fourth>; riaktag="foo"
Etag: 5Fs0VskZWx7Y25tf1oQsvS
Last-Modified: Wed, 24 Feb 2010 15:25:51 GMT

the third
--3ai6VRl4aLli3dKw8tG9unUeznT
X-Riak-Vclock: a85hYGBgzGDKBVIsLEHbN2YwJTLmsTLMPvDzCF8WAA==
Location: /riak/hb/second
Content-Type: text/plain
Link: </riak/hb>; rel="up", </riak/hb/fourth>; riaktag="foo"
Etag: 2ZKEJ2gaT57NT7xhLDPCQz
Last-Modified: Wed, 24 Feb 2010 15:24:11 GMT

the second
--3ai6VRl4aLli3dKw8tG9unUeznT--

--N2gzGP3AY8wpwdQY0jio62L9nJm--

It’s also possible to express the same query in map-reduce, directly:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

That’s the exact same query. The content type of the response is different. It’s now a JSON array with two elements: a JSON encoding of the hb/second object, and a JSON encoding of the hb/third object. (Pretty-printed here, for clarity.)

[
    {
        "bucket": "hb",
        "key": "second",
        "vclock": "a85hYGBgzGDKBVIsLEHbN2YwJTLmsTLMPvDzCF8WAA==",
        "values": [
            {
                "metadata": {
                    "Links": [
                        ["hb","fourth","foo"]
                    ],
                    "X-Riak-VTag": "2ZKEJ2gaT57NT7xhLDPCQz",
                    "content-type": "text/plain",
                    "X-Riak-Last-Modified": "Wed, 24 Feb 2010 15:24:11 GMT",
                    "X-Riak-Meta": []
                },
                "data": "the second"
            }
        ]
    },
    {
        "bucket": "hb",
        "key": "third",
        "vclock": "a85hYGBgzGDKBVIsTKLLozOYEhnzWBn+H/h5hC8LAA==",
        "values": [
            {
                "metadata": {
                    "Links": [
                        ["hb","fourth","foo"]
                    ],
                    "X-Riak-VTag": "5Fs0VskZWx7Y25tf1oQsvS",
                    "content-type": "text/plain",
                    "X-Riak-Last-Modified": "Wed, 24 Feb 2010 15:25:51 GMT",
                    "X-Riak-Meta": []
                },
                "data": "the third"
            }
        ]
    }
]

Another interesting query is “follow only links that are tagged foo.” For that, just add a tag field to the link phase spec:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{"tag":"foo"}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

Here you should get a JSON array with one element: a JSON encoding of the hb/second object. The link to the hb/third object was tagged bar, so that link was not followed. The equivalent URL syntax is:

$ curl http://localhost:8098/riak/hb/first/_,foo,_

It’s also possible to filter links by bucket by adding a bucket field to the link phase spec, or by replacing the first underscore with a bucket name in the URL format. But, all of our example links point to the same bucket, so hb is the only interesting setting here.

Link phases may also be chained together (or put after other phases if those phases produce bucket/key lists). For example, we could follow the links all the way from hb/first to hb/fourth with:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"link":{}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

(Notice the added link phase.) If you run that, you’ll find that you get two copies of the hb/fourth object in the response. This is because we didn’t bother uniquifying the results of the link extraction, and both hb/second and hb/third link to hb/fourth. A reduce phase is fairly easy to add:

$ curl -X POST -H "content-type:application/json" 
  http://localhost:8098/mapred --data @-
{"inputs":[["hb","first"]],"query":[{"link":{}},{"link":{}},{"reduce":{"language":"erlang","module":"riak_mapreduce","function":"reduce_set_union"}},{"map":{"language":"javascript","source":"function(v)
{ return [v]; }"}}]}
^D

The resource handling the URL link-walking format does just this:

$ curl http://localhost:8098/riak/hb/first/_,_,_/_,_,_

That should get you just one copy of the hb/fourth object.

So why choose either map/reduce or URL-syntax? The advantage of URL syntax is that if you’re starting from just one object, and just want to get the objects at the ends of the links, and you can handle multipart/mixed encoding, then URL syntax is much simpler and more compact. Map/reduce with link phases should be your choice if you want to start from multiple objects at once, or you want to get some processed or aggregated form of the objects, or you want the result to be JSON-encoded.

Riak version 0.8 note: In Riak 0.8, the format of the result of ‘link’ map/reduce phases was not able to be transformed into JSON. This meant both that it was not possible to put a Javascript reduce phase right after a link phase, and also that it was not possible to end an HTTP map/reduce query with a link phase. Those issues have been resolved in the tip of the source repository, and will be part of the 0.9 release.

-Bryan