November 29, 2011

It has been pointed out on occasion that Riak MapReduce isn’t real “MapReduce”, often with reference to Hadoop, which is. There are many times that Riak’s data processing pipeline is exactly what you want, but in case it isn’t, and you want to leverage existing Hadoop expertise and investment, you may now use Riak as an input/output, instead of HDFS.

This started off as a tinkering project, and it is currently released as riak-hadoop-0.2. I wouldn’t recommend it for production use today, but it is ready for exploratory work, whilst we work on some more serious integration for the future.

Input

Hadoop M/R usually gets its input from HDFS, and writes its results to HDFS. Riak-hadoop is a library that extends Hadoop’s InputFormat and OutputFormat classes so that a Riak cluster can stand in for HDFS in a Hadoop M/R job. The way this works is pretty simple. When defining your Hadoop job, you declare the InputFormat to be of type RiakInputFormat. You configure you cluster members and locations using the JobConf, and a helper class called RiakConfig. Your Mapper class must also extend RiakMapper, since there are some requirements for handling eventual consistency that you must satisfy. Apart from that, you code your Map method as if for a typical Hadoop M/R job.

Keys, for the splits

When Hadoop creates a Mapper task it assigns an InputSplit to that task. An input split is the subset of data that the Mapper will process. In Riak’s case this is a set of keys. But how do we get the keys to Map over? When you configure your job, you specify a KeyLister. You can use any input to Hadoop M/R that you would use for Riak M/R: provide a list of bucket/key pairs, a 2i query, a Riak Search query, or, ill advisedly, a bucket. The KeyLister will fetch the keys for the job and partition them into splits for the Mapper tasks. The Mapper tasks then access the data for the keys using a RiakRecordReader. The record reader is a thin wrapper around a Riak client, it fetches the data for the current key when the Hadoop framework asks.

Output

In order to output reduce results to Riak your Reducer only need implement the standard Reducer interface. When you configure the Job, just specify that you wish to use the RiakOutputFormat, and declare an output bucket as a target for results. The keys/values from your reduce will then be written to Riak as regular Riak objects. You can even specify secondary indexes, Riak metadata and Riak links on your output values, thanks to the Riak Java Client’s annotations and object mapping (courtesy of Jackson’s object mapper.)

Hybrid

Of course you don’t need to use Riak for both input and output. You could read from HDFS, process and store results in Riak, or read from Riak and store results in HDFS.

Why do this?

This is really a proof of concept integration. It should be of immediate use to anyone who already has Hadoop knowledge and a Hadoop cluster. If you’re a Riak user with no Hadoop requirements right now, I’d say, don’t go there at once: setting up a Hadoop cluster is way more complex than running Riak, and maintaining it is, operationally, taxing. If, however, you already have Hadoop, adding Riak as a data source and sink is incredibly easy, and gives you a great, scalable, live database for serving reads and taking writes, and you can leverage your existing Hadoop investment to aggregate that data.

What next?

The thinking reader might be saying “Huh? You stream the data in and out over the network, piecemeal?”. Yes, we do. Ideally we’d do a bulk, incremental replication between Riak and Hadoop (and back) and that is the plan for the next phase of work.

Summary

Riak-Hadoop enables Hadoop users to use a Riak cluster as a source and sink for Hadoop M/R jobs. This exposes he entire Hadoop toolset to Riak data (including the query languages like Hive and Pig!) This is only a first phase pass at the integration problem, and though usable today, smarter sync is coming.

Please clone, build, and play with this project. Have at it. There’s a follow up post with a look at an example Word Count Hadoop Map/Reduce job coming soon. If you can’t wait, just add a dependency on riak-hadoop, version 0.2 to your pom.xml and get started. Let me know how you get on, via the Riak mailing list.

Russell