Tag Archives: MapReduce

Webinar Recap – MapReduce Querying in Riak

July 7, 2010

Thank you to all who attended the webinar last Thursday, it was a great turnout with awesome engagement. Like before, we’re recapping the questions below for everyone’s sake (in no particular order).

Q: Say I want to perform two-fold link walking but don’t want to keep the “walk-through” results, including the initial one. Can I do something to keep only the last result?

In a MapReduce query, you can specify any number of phases to keep or ignore using the “keep” parameter on the phase. Usually you only want to keep the final phase. If you’re using the link-walker resource, it’ll return results from any phases whose specs end in “1”. See the REST API wiki page for more information on link-walking.

Q: Will Riak Search work along with MapReduce, for example, to avoid queries over entire bucket?Will there be a webinar about Riak Search?

Yes, we intend to have this feature in the Generally Available release of Riak Search. We will definitely have a webinar about Riak Search close to its public release.

Q: Are there still problems with executing “qfun” functions from Erlang during MapReduce?

“qfun” phases (that use anonymous Erlang functions) will work on a one-node cluster, but not across a multi-node cluster. You can use them in development but it’s best to switch to a compiled module function or Javascript function when moving to production.

Q: Although streams weren’t mentioned, do you have any recommendations on when to use streaming map/reduce versus normal map/reduce?

Streaming MapReduce sends results back as they get produced from the last phase, in a multipart/mixed format. To invoke this, add ?chunked=true to the URL when you submit the job. Streaming might be appropriate when you expect the result set to be very large and have constructed your application such that incomplete results are useful to it. For example, in an AJAX web application, it might make sense to send some results to the browser before the entire query is complete.

Q: Which way is faster: storing a lot of links or storing the target keys in the value as a list? Are there any limits to the maximum number of links on a key?

How the links are stored will likely not have a huge impact on performance. If you choose to store a key list in a document, both methods would work. There are two relevant operations that would be performed with the key list document (updating and traversal).

The update process would involve retrieving the list, adding a value, and saving the list. If you are using the REST interface you will need to be aware of limitations in the number of allowed headers and the allowed header length. Mochiweb restricts the number of allowed headers to 1000. Header length is limited to 8192 characters. This imposes an upper limit for the number of Links that can be set through the REST interface.

The best method for updating a key list would be to write a post commit hook that performed the update. This avoids the need to access the key list using the REST interface so header limitations are no longer a concern. However, the post-commit hook could become a bottleneck in your update path if number of links grows large.

Traversal involves retrieving the key list document, collecting the related keys, and outputting a bucket/key list to be used in proceeding map phases. A built-in function is provided to process links. If you were to store keys in the value you would need to write a custom function to parse the keys and generate a bucket/key list.

Q: What’s the benefit of passing an arg to a map or reduce phase? Couldn’t you just send the function body with the arg value filled in? Can I pass in a list of args or an arbitrary number of args?

When you have a lot of queries that are similar but with minor differences, you might be able to generalize a map or reduce function so that it can vary based on the ‘arg’ parameter. Then you could store that function in a built-ins library (see the question below) so it’s preloaded rather than evaluated at query-time. The arg parameter can be any valid JSON value.

Q: What’s the behavior if the map function is missing from one or more nodes but present on others?

The entire query will fail. It’s best to make sure, perhaps via automated deployment, that all of your functions are available on all nodes. Alternatively, you can store Javascript functions directly in Riak and use them in a phase with “bucket” and “key” instead of “source” or “name”.

Q: If there are 2 map phases, for example, then does that mean that both phases will be run back to back on each individual node and *then* it’s all sent back for reduce? Or is there some back and forth between phases?

It’s more like a pipeline, one phase feeds the next. All results from one phase are sent back to the coordinating node, which then initiates the subsequent phase once all participating nodes have replied.

Q: Would it be possible to send a function which acts as both a map predicate and an updater?

In general we don’t recommend modifying objects as part of a MapReduce job because it can add latency to the request. However, you may be able to implement this with a map function in Erlang. Erlang MapReduce functions have full access to Riak including being able to read and write data.

“`erlang
%% Inside your own Erlang module
map_predicate_with_update(Value,_KeyData,_Arg) ->
case predicate(Value) of
true -> [update_passed_value(Value)];
_ -> []
end.

update_passed_value(Value) ->
{ok, C} = riak:local_client(),
%% modify your object here, store with C:put
ModifiedValue.
“`

This could come in handy for large updates instead of having to pull each object, update it and store it.

Q: Are Erlang named functions or JS named functions more performant? Which are faster — JS or Erlang functions?

There is a slight overhead when encoding the Riak object to JSON but otherwise the performance is comparable.

Q: Is there a way to use namespacing to define named Javascript functions? In other words, if I had a bunch of app-specific functions, what’s the best way to handle that?

Yes, checkout the built-in Javascript MapReduce functions for an example.

Q: Can you specify how data is distributed among the cluster?

In short, no. Riak consistently hashes keys to determine where in the cluster data is located. This article explains how data is replicated and distributed throughout the cluster. In most production situations, your data will be evenly distributed.

Q: What is the reason for the nested list of inputs to a MapReduce query?

The nested list lets you specify multiple keys as inputs to your query, rather than a single bucket name or key. From the Erlang client, inputs are expressed as lists of tuples (fixed-length arrays) which have length of 2 (for bucket/key) or 3 (bucket/key/key-specific-data). Since JSON has no tuple type, we have to express the inputs as arrays of length 2 or 3 within an array.

Q: Is there a syntax requirement of JSON for Riak?

JSON is only required for the MapReduce query when submitted via HTTP, the objects you store can be in any format that your application will understand. JSON also happens to be a convenient format for MapReduce processing because it is accessible to both Erlang and Javascript. However, it is fairly common for Erlang-native applications to store data in Riak as serialized Erlang datatypes.

Q: Is there any significance to the name of file for how Riak finds the saved functions? I assume you can leave other languages in the same folder and it would be ignored as long as language is set to javascript? Additionally, is it possible/does it make sense to combine all your languages into a single folder?

Riak only looks for “*.js” files in the js_source_dir folder (see Configuration Files on the wiki). Erlang modules that contain map and reduce functions need to be on the code path, which could be completely separate from where the Javascript files are located.

Q: Would you point us to any best practices around matrix computations in Riak? I don’t see any references to matrix in the riak wiki…

We don’t have any specific support for matrix computations. We encourage you to find an appropriate Javascript or Erlang library to support your application.

Dan and Sean

Riak Fast Track Revisited

May 27, 2010

You may remember a few weeks back we posted a blog about a new feature on the Riak Wiki called The Riak Fast Track. To refresh your memory, “The Fast Track is a 30-45 minute interactive tutorial that introduces the basics of using and understanding Riak, from building a three node cluster up through MapReduce.”

This post is intended to offer some insight into what we learned from the launch and what we are aiming to do moving forward to build out the Fast Track and other similar resources.

The Numbers

The Fast Track and accompanying blog post were published on Tuesday, May 5th. After that there was a full week to send in thoughts, comments, and reviews. In that time period:

  • I received 24 responses (my hope was for >15)
  • Of those 24, 10 had never touched Riak before
  • Of those 24, 13 said they were already planning on using Riak in production or after going through the
    Fast Track were now intending to use Riak in production in some capacity

The Reviews

Most of the reviews seemed to follow a loose template: “Hey. Thanks for this! It’s a great tool and I learned a lot. That said, here is where I think you can improve…”

Putting aside the small flaws (grammar, spelling, content flow, etc.), there emerged numerous recurring topics:

  • Siblings, Vector Clocks, Conflict Resolution, Concurrent Updates…More details please. How do they work in Riak and what implications do they have?
  • Source can be a pain. Can we get a tutorial that uses the binaries?
  • Curl is great, but can we get an Erlang/Protocol Buffers/language specific tutorial?
  • I’ve heard about Links in Riak but there is nothing in the Fast Track about it. What gives!?
  • Pictures, Graphics and Diagrams would be awesome. There is all this talk of Rings, Clusters, Nodes, Vnodes, Partitions, Vector Clocks, Consistent Hashing, etc. Some basic diagrams would go along way in helping me grasp the Riak architecture.
  • Short, concise screencasts are awesome. More, please!
  • The Basic API page is great but it seems a bit…crowded. I know they are all necessary but do we really need all this info about query parameters, headers and the like in the tutorial?

Another observation about the nature of the reviews: they were very long and very detailed. It would appear that a lot of you spent considerable time crafting thoughtful responses and, while I was expecting this to some extent, I was still impressed and pleasantly surprised.

This led me to draw two conclusions:

  1. People were excited by the idea of bettering the Fast Track for future Riak users to come
  2. Swag is a powerful motivator

Now, I’m going to be a naïve Community Manager and let myself believe that the Riak Developer Community maintains a high level of programmer altruism. The swag was just an afterthought, right?

So What Did We Change?

We have been doing the majority of the editing and enhancing on the fly. This process is still ongoing and I don’t doubt that some of you will notice elements still present that you thought needed changing. We’ll get there. I promise.

Here is a partial list of what was revised:

  • The majority of changes were small and incremental, fixing a phrase here, tweaking a sentence there. Many small fixes and tweaks go a long way!
  • The most-noticeable alterations are on the MapReduce page, where we worked a lot to make it flow better and more interactive. This continues to be improved.
  • The Basic API Operations page got some love in the form of simplification. After reading your comments, we went back and realized that we were probably throwing too much information at you too fast.
  • There are now several graphics relating to the Riak Ring and Consistent Hashing. There will be more.

And, as I said, this is still ongoing.

Thank You!

I’ve added a Thank You page to the end of the Fast Track to serve as a permanent shout-out to those who help revise and refine the Fast Track. (I hope to see this list grow, too.) Future newcomers to Riak will surely benefit from your time, effort, and input.

What is Next?

Since its release, the Fast Track tutorial has become the second most-visited page on the Riak Wiki, second only to the wiki.basho.com itself. This tells us here at Basho that there is a need for more tools and tutorials like this. So our intention is to expand this as far as time permits.

In the short term, we plan to add a link-walking page. This was scheduled for the original iteration of the Fast Track but was scrapped because we didn’t have time to assemble all the components. The MapReduce section is going to get more interactive, too.

Another addition will be content and graphics that demonstrate Riak’s fault-tolerance and ability to withstand node outages.

We also want to get more specific with languages. Right now, it uses curl over HTTP. This is great but language-specific makes tremendous sense, and the only preventing us from doing this is time. The ultimate vision is to expand transform the Fast Track into a sort of “choose your own adventure” module, such that if a Ruby dev who prefers Debian shows up at wiki.basho.com without having ever heard of Riak, they can click a few links and arrive at a tutorial that shows them how to spin up three nodes of Riak on Debian and query it through Ripple. Erlang, Ruby, Javascript and Java are at the top of the list.

But, we have a long way to go before we get there, so stay tuned for continuous enhancements and improvements. And if you’re at all at interested in helping develop and expand the Fast Track (say, perhaps, outlining an up-and-running tutorial for for Riak+JavaScript) don’t hesitate to shoot an email to mark@basho.com.

Mark

Community Manager

Practical Map-Reduce – Forwarding and Collecting

This post is an example of how you can solve a practical querying problem in Riak with Map-Reduce.

The Problem

This query problem comes via Jakub Stastny, who is building a task/todolist app with Riak as the datastore. The question we want to answer is: for the logged-in user, find all of the tasks and their associated “tags”. The schema looks kind of like this:

Each of our domain concepts has its own bucket – users, tasks and tags. User objects have links to their tasks, tasks link to their tags, which also link back to the tasks. We’ll assume the data inside each object is JSON.

The Solution

We’re going to take advantage of these features of the map-reduce interface to make our query happen:

1. You can use link phases where you just need to follow links on an object.
2. Inputs to map phases can include arbitrary key-specific data.
3. You can have as many map, reduce, and link phases as you want in the same job.

Let’s construct the JSON job step-by-step, starting with the input – the user object.

Next, we’ll use a link phase to find my tasks.

Now that we’ve got all of my tasks, we’ll use this map function to extract the relevant data we need from the task — including the links to its tags — and pass them along to the next map phase as the keydata. Basically it reads the task data as JSON, filters the object’s links to those only in the “tags” bucket, and then uses those links combined with our custom data to feed the next phase.

Here’s the phase that uses that function:

Now in the next map phase (which operates over the associated tags that we discovered in the last phase) we’ll insert the tag object’s parsed JSON contents into the “tags” list of the keydata object that was passed along from the previous phase. That modified object will become the input for our final reduce phase.

Here’s the phase specification for this phase (basically the same as the previous except for the function):

Finally, we have a reduce phase to collate the resulting objects with their included tags into single objects based on the task name.

Our final phase needs to return the results, so we add *”keep”:true* to the phase specification:

Here’s the final format of our Map/Reduce job, with indentation for clarity:

I input some sample data into my local Riak node, linked it up according to the schema described above and this is what I got:

Conclusion

What I’ve shown you above is just a taste of what you can do with Map/Reduce in Riak. If the above query became common in your application, you would want to store those phase functions we created as built-ins and refer to them by name rather than by their source. Happy querying!

Sean

Calling all Rubyists – Ripple has Arrived!

February 11, 2010

The Basho Dev. Team has been very excited about working with the Ruby community for some time. The only problem was we were heads down on so many other projects that it was hard to make any progress. But, even with all that work on our plate, we were committed to showing some love to Rubyists and their frameworks.

Enter Sean Cribbs. As Sean details in his latest blog post, Basho and the stellar team at Sonian made it possible for him to hack on Ripple, a one-of-a-kind client library and object mapper for Riak. The full feature set for Ripple can be found on Sean’s blog, but highlights include a DataMapper-like API, an easy builder-style interface to Map/Reduce, and near-term integration with Rails 3.

And, in case you need any convincing that you should consider Riak as the primary datastore for your next Rails app, check out Sean’s earlier post, “Why Riak should power your next Rails app.”

So, if you’ve read enough and want to get your hands on Ripple, go check it out on GitHub.

If you don’t have Riak downloaded and built yet, get on it.

Lastly, you are going to be seeing a lot more Riak in your Ruby. So stay tuned because we have some big plans.

Best,

Mark

The Release Riak 0.8 and JavaScript Map/Reduce

February 3, 2010

We are happy to announce the release of Riak 0.8 available for download immediately. Riak 0.8 features a number of enhancements to the core map/reduce machinery that will make Riak more accessible to a wider audience. The biggest enhancement is the ability to write map/reduce queries in JavaScript. We’re using our erlang_js project to integrate Mozilla’s Spidermonkey engine directly into Riak to keep overhead to a minimum.

We’ve also built a spiffy REST API for submitting map/reduce queries. Queries are described in JSON and POST-ed to the Riak server. Results are sent back as JSON for your processing pleasure. And, the REST interface supports streaming results for large result sets, too.

To kick it all off, we’ve put together a short screencast demonstrating how to use Riak’s flashy new features. You can watch it below, or view it on Vimeo. There’s also a slew of bug fixes and optimizations included in Riak 0.8. See the release notes for all the juicy details.

Download and enjoy!

View on Vimeo

Basho Podcast Number 1 – Justin Sheehy and Tony Falco on Scaling out with Riak and Riak Search

December 11, 2009

Just out: Basho’s first podcast discussing Riak. Justin Sheehy and Tony Falco revisit the definition of scalability Justin first discussed at NoSQL East 2009, discuss EC2, Riak, and Riak’s map/reduce and soon-to-be-released distributed search and indexing. As a special bonus, at 3:24 in the podcast, listen for the sound of Kevin Smith’s SMS accepting the job at Basho. The mic did not pick up Justin’s grimace. Of course, he didn’t miss a beat. “I just did, Bob….”

Enjoy,

Mark Phillips



Right click here to download the Podcast