Tag Archives: Riak

Riak In Production At Posterous

January 31, 2012

The videos from last month’s San Francisco Riak Meetup are online and ready for consumption. The first features Julio Capote giving a short overview of the work he and Posterous are doing with Riak as a post cache. The second presentation was from Mark Phillips and it was all about Riak Control, the new Riak Admin Tool that will be fully supported in the forthcoming Riak 1.1 release.

We also need to give a special thanks to Jana Boruta and the team at StackMob for furnishing the awesome venue.

Enjoy, and thanks for being a part of Riak.

The Basho Team

 

Riak In Production At Posterous

This talk runs about 11 minutes. In it, Julio details the importance of the post cache at Posterous, what their initial solution to the problem was, and how they went about selecting Riak over MongoDB, MySQL, and Redis.

Preview of Riak Control

This talk runs just under 30 minutes. Mark starts with a history of the Riak Admin UI, details Basho’s motivations for writing and open-sourcing Riak Control, and then gives a live demo of the tool and talks about future enhancements.

QuickChecking Poolboy for Fun and Profit

This post appeared originally on Andrew Thompson’s Blog.

January 27, 2012

TL;DR

  • Unit tests are great, but they can’t test everything
  • Code always has bugs
  • QuickCheck helps you generate testcases at a volume where writing unit tests would be impractical
  • Negative testing is as important as positive testing (test the invalid inputs)
  • Automatically shrinking test cases to the minimal case is immensely helpful
  • If you write erlang commercially, you should really consider looking at property-based testing because it will find bugs you’ll never be able to replicate otherwise

This week, the Basho engineering team flew out to Denver and spent a week at the Oxford Hotel. Also attending was John Hughes, the CEO of QuviQ, who spent the week teaching a bunch of us how to use his property-based software testing tool, Quickcheck.

Property-based testing, for those unfamiliar with the term, is where you define some ‘properties’ about your software and then QuickCheck tries to come up with some combination of steps/inputs that will break your software. Beyond that it will shrink the typically massive failing cases it finds down to the minimal combination needed to provoke the failure (typically a handful of steps). However, I’m not going to go into details on how QuickCheck works, just on the results it provided.

After two days of working through the QuickCheck training material and the exercises, we were ready to start writing our own QuickCheck tests against some of Riak’s code. I chose to start out with testing poolboy, the erlang worker pool library Riak uses internally for some tasks.

Poolboy was actually third party code written by devinus from #erlang on Freenode. I needed a worker pool implementation for implementing worker pools in riak_core, specifically for doing asynchronous folds in riak_kv (but it’s a general feature in riak_core). I didn’t feel like writing my own, so I looked around and settled on poolboy, I added a bunch of tests, fixed a couple bugs, added a way to check out workers without blocking if none were available and started using it.

Now, poolboy had 85% test coverage (and most of the remaining 15% was irrelevant boilerplate) when I started QuickChecking it, and I felt pretty happy with its solidity, so I didn’t expect to find many bugs, if any. I was very wrong.

So, my first step was to write a simple QuickCheck model for poolboy using eqc_statem, the quickcheck helper for testing stateful code. The abstract model for poolboy’s internals is pretty simple, all we really need to keep track of is the pid of the pool, the minimum size of the pool and by how much it can ‘overflow’ with ephemeral workers and the list of workers currently checked out. From those bits of data, we can model how poolboy should behave, and those become the ‘property’ we test.

Initially, I only tested starting, stopping, doing a non-blocking checkout and checking a worker back in. I omitted testing blocking checkouts since they’re a little harder to do. This initial property checked out fine, no bugs found (except in the property).

Next I added blocking checkouts, and suddenly the property failed. The output is a little hard to read, but the steps are;

  1. Start poolboy with a size of 0 and an overflow of 1
  2. Do a non-blocking checkout, which succeeds
  3. Do a blocking checkout that fails (with a timeout)
  4. Check the worker obtained in step 2 back in
  5. Do another non-blocking checkout
The result of step 5 should be a worker, but we get full instead.
Turns out non-blocking checkouts have a bug if the timeout on the block happens and then a worker becomes available. This happens because the caller is blocked by the FSM storing the ‘From’ argument in a queue and popping that queue whenever a worker becomes available. However, if the caller times out during the checkout the ‘From’ is left in the queue, the next worker checked in will be sent to a process no longer expecting it (which might not even be alive). This means poolboy leaks workers in this case. I fix this by keeping track when the checkout request is made, and what the timeout on it was and discarding elements from the waiting queue who have expired.

After making this change, the counterexample quickcheck found now passes. The next thing I decided to check was if workers dying while they’re checked out is handled correctly. I added a ‘kill_worker’ command which randomly kills a checked out worker. I run this test with a lot of iterations and I find a second counterexample. This is what happens this time:

  1. Start a pool with a size of 1 and overflow of 1
  2. Do 3 non-blocking checkouts, first 2 succeed, the third rightfully fails
  3. Check both of the workers we successfully checked out back in
  4. Check a worker back out
  5. Kill it while its checked out
  6. Do 2 more checkouts, both should succeed but instead the second one reports the pool is ‘full’

Clearly something is wrong. I actually re-ran this a bunch of times and found a bunch of similar counterexamples. I had a really hard time debugging this until John suggested looking at the pool’s internal state to see what it thought was going on. So, I added a ‘status’ call to poolboy that would report its internal state (ready, overflow or full) and the number of the permanent and overflow workers. John also suggested I use a dynamic precondition, which allowed me to cross-check the model and pool’s state before each step and exit() on any discrepancy. This led to me finding lots of places where poolboy’s internal state was wrong, mainly around when it changed between the 3 possible states.

With those issues fixed, I moved on to checking what happened if a worker died while it was checked in. I wrote a command that would check out a worker, check it back in and then kill it. QuickCheck didn’t find any bugs initially, but then I remembered an issue poolboy had where poolboy was using tons of ram because it was keeping track of way too many process monitors. Whenever you check a worker out of poolboy, poolboy monitors the pid holding the worker so if it dies, poolboy can also kill the worker and create some free space in the pool. So, I decided to add the number of monitors as one of the things crosschecked between what the model expected and what poolboy actually had.

The latest counterexample went like this:

  1. Pool size 2, no overflow
  2. Checkout a worker
  3. Kill an idle worker (check it out, check it back in and then kill it)
  4. Checkout a worker

The crosscheck actually blew up right before step 4, saying poolboy wasn’t monitoring any processes, when clearly it should have been monitoring who had done the checkout in step 2. I looked at the code and found when it got an EXIT message from a worker that wasn’t currently checked out, it set the list of monitors to the empty list, blowing away all tracking of who had what worker checked out. This was pretty serious, but not that hard to fix; I just didn’t change the list of monitors in that case, instead of zeroing it out.

However, seeing that serious flaw made me wonder more about how poolboy handled unexpected EXITs in other cases, like an EXIT from a process that wasn’t a worker. This could happen if you linked to the poolboy process for some reason and then that process exited. You might even want to do this to make sure your code knew if the pool exited, but in erlang links are both ways. So, I went ahead and wrote a command to generate some spurious exit messages for the pool. As was becoming normal, QuickCheck quickly found a counterexample:

  1. Pool size 1, no overflow
  2. Checkout a worker
  3. Send a spurious EXIT message
  4. Kill the worker we checked out
  5. Stop the pool

Right before step 5, the crosscheck failed telling me poolboy thought it had 2 workers available, not one. Clearly this was another bug, and sure enough poolboy was assuming any EXIT messages were from workers and it’d start a new worker to replace the dead one, actually growing the size of the pool beyond the configured limits. So, I changed the code to ignore EXIT messages from non-worker pids, but to handle the death of checked in workers correctly.

After all the bugs around EXIT messages, I decided to randomly checkin non-worker pids 10% of the time and see what happened. Again, poolboy wasn’t checking for this condition and strange things would happen to the internal state. The fix was very similar to the one for spurious EXIT messages.

Now, I was beginning to run out of ways to break poolboy. I looked at the test coverage and saw that certain code around blocking checkouts was being hit by the unit tests but not by QuickCheck. Now, QuickCheck can run commands serially or parallel, and I had only been running commands serially so far. So, I added a parallel property and tried to run it. It blew up telling me dynamic preconditions weren’t allowed. John told me this was actually the case, and so I just commented it out. We’d lose our cool crosschecking but it could always be uncommented if needed.

With the parallel tests running, I started to get counterexamples like this:

Common prefix

  • Start pool with size of 1, no overflow

Process 1

  • Check out a worker

Process 2

  • Check out a worker

Now, problem was, both checkouts would succeed. This is clearly wrong, until you understand that process 1 might exit before process 2 does the checkout, in which case poolboy notices and frees up space in the pool, at which point process 2 can successfully and validly check out a worker. John again suggested a neat trick where we’d add a final command to each branch that’d call erlang:self() (which returns the current pid). I then modified the tracking of checked out workers to include which worker had done the checkout, so we knew which workers would be destroyed (and their slots in the pool freed) when one of the parallel branches exited. This worked great and I was able to hit the code paths that were unreachable from a purely serial test.

However, no matter how many iterations I ran, I couldn’t get another valid counterexample (I ran into some races in the erlang process registry, but those are well known and harmless). At this point, finally, we knew that barring flaws in the model, poolboy was pretty sound and this adventure came to an end.

Interestingly, at no point did any of the original unit tests fail. However, I omitted describing the many bugs I found in my own model and how I was using QuickCheck, since I can’t really remember any of them, and they don’t matter in the long run.

Finally, I’d like to thank John Hughes for the great instruction and for being patient and helpful in the face of the crazy things I ran into developing and testing the QuickCheck property, Basho for being so dedicated to software quality that they provide all of their engineers with this great tool and the training to use it correctly and all the people that helped proof-read this post.

If you have any feedback, you can email me at andrew AT hijacked.us. Also, if you think doing stuff like this is interesting, we’re hiring!

Andrew

Congratulations, Amazon!

January 18, 2012

Amazon’s unveiling earlier today of DynamoDB, Amazon’s new datastore, is good for the NoSQL movement and great for customer choice.

We were early believers in the set of principles outlined in the original Amazon Dynamo paper that was published in 2007. In fact, we were so impressed, that we built our entire company around it. We launched Riak as open source software in 2009 and this has created a community of thousands of loyal users. The result has been a distributed database that that has been tested and proven to massively scale. Just ask Voxer: Walkie Talkie App Voxer Soars Past Ten Billion Operations Per Day Powered by Riak.

More importantly, the Amazon announcement tells us the power of NoSQL. Almost five years following the release of the Dynamo paper, the world’s largest e-commerce company is doubling down on NoSQL and the role of key-value stores. This aligns with a trend that has accelerated over the past year. The advantages from a scalable database that can elastically grow and shrink, with high throughput at very low-latency, and that never loses a write are increasingly viewed as highly compelling. Not to mention the massive cost advantage – infrastructure and operational savings – versus last generation architectures. Amazon has acted (again) and it further reiterates that others should too.

At Basho, we are in favor of anything that expands customer choice. A dynamo-as-a-service offered by Amazon on their ecosystem will appeal to some. For others, the benefits of a Dynamo-inspired product that can be deployed on other public clouds, behind-the-firewall, or not on the cloud at all, will be critical.

For cloud providers, we believe they will see the wisdom in Amazon’s decision to provide a fully managed, Dynamo-inspired, NoSQL product. Of course, we know where they can get one!

Some will choose DynamoDB in the Amazon cloud, some will chose other Dynamo-inspired databases such as Riak behind their own firewall, some will chose Dynamo-inspired databases in clouds other than Amazon. It’s good to have choices.

Congratulations again to Amazon for providing leadership in cloud technology.

New Riak Handbook Available Now for Download


Former Basho Developer Advocate Mathias Meyer authors a comprehensive, hands-on guide to Riak.

CAMBRIDGE, MA – January 17, 2012 – Basho Technologies, the leader in highly-available, distributed data store technologies, today announced that former Basho developer advocate Mathias Meyer has completed Riak Handbook, a comprehensive, hands-on guide to Riak, Basho’s industry-leading, open source, distributed database.

Riak Handbook begins by exploring the driving forces behind Riak, including Amazon Dynamo, eventual consistency and CAP Theorem. Through a collection of examples and code, Mathias Riak Handbook walks through Riaks many features in detail including the following capabilities:

  • How to store-and-retrieve data in Riak
  • Analyze data with MapReduce using JavaScript and Erlang
  • Build and search full-text indexes with Riak Search
  • Index and query data using secondary indexes
  • Model data for eventual consistency
  • Scale to multi-node clusters in less than five minutes
  • Operate Riak in production
  • Handle failures in your application

Mathias Meyer is an experienced software developer, consultant and coach from Berlin, Germany. He has worked with database technology leaders such as Sybase and Oracle. He entered into the world of NoSQL in 2008 and worked at Basho Technologies from 2010 to 2011.

“We are excited that Mathias took on the endeavor to build a comprehensive book all about Riak,” said John Hornbeck, Vice President of Client Services, Basho Technologies. “Our customers and community will benefit from having a single source that covers everything from setting up Riak, to scaling out quickly, to operating and maintaining Riak. We have already seen strong customer interest in Riak Handbook, including many seeking site licenses to outfit their entire teams.”

Riak Handbook is available for purchase at riakhandbook.com. Single editions are available at $29/download. Site licenses are available for organizations implementing Riak for only $249.

About Basho Technologies
Basho Technologies is the leader in highly-available, distributed data store technologies used to power scalable, data-intensive Web, mobile and e-commerce applications. Our flagship product, Riak, frees customer applications from the performance, scalability, and availability constraints of traditional databases while reducing overall storage and support costs by up to 80%. Basho customers, including fast-growing Web businesses and large Fortune 500 enterprises, use Riak to implement global session stores, to aggregate large amounts of data for logging, search, and analytics, and to manage, store and stream unstructured data.

Riak is available open source for download at basho.com/resources/downloads. Riak EnterpriseDS is available with advanced replication, services and 24/7 support. For more information visit basho.com or follow us on Twitter at www.twitter.com/basho.

Basho Technologies is based in Cambridge, MA, and maintains regional offices in San Francisco, CA and Reston, VA.

2011 Was A Huge Year. Onto 2012

December 28, 2011

Distilling a year’s worth of work, innovation, and growth into one blog post is a fool’s errand. But we wanted to give it a shot regardless. This post is long, but it’s well worth the read. Make it to the end and you’ll see why. If you get there and regret it, let me know. I’ll send you some stickers.

2011 – A Look Back

2011 started off big for Basho and Riak. The fruits of our engineering labor were revealed in the Riak 0.14 Release that was made official on January 5th. This was a momentous event for us, and in the release were various feature additions and enhancements, along with copious bug fixes and usability improvements.

Next, in February, came a $7.5 Million Round of funding from some new and existing investors; they believed (and still believe) in our vision and product, and this money was put to good use building out the Basho team and pushing Riak farther.

With fresh funding in our coffers, we kept our heads down and continued to hack and hustle through February and March, picking up production users and closing new deals. April brought new interest in Riak Core, the framework that forms the backbone of Riak’s distributed capabilities. Companies like Yahoo! and AOL began to build applications on it for various use cases, and we did our best to make the project more usable outside of Riak. (There is still much to do to make Core truly accessible to developers, and, time permitting, we hope to address this in 2012.)

May arrived and we ruffled a few feathers with a blog post about what we thought was a theme that needed addressing in the NoSQL space. Also in May, Basho Board Member Eric Brewer was recruited to help Google plan and execute their cloud vision, one of the many accomplishments various members of the Basho Team would notch this year.

Corporate developments took center stage in June. We opened a new office in San Francisco, a move precipitated by massive user and customer growth on the West Coast. BashoWest, as we call it, has since become a co-working space of sorts in addition to our West Coast HQ, and we’re continuing to expand our efforts to spread knowledge about distributed systems and sound computing practices to developers. Later on that month we announced additional funding and the addition of Don Rippert as Basho CEO.

To start off July, we made it known that support for Google’s LevelDB would be part of the next release, a move that would let users take better advantage of Riak’s pluggable storage capabilities. Lager, a new logging framework for Erlang/OTP was also released and announced by Andrew Thompson. Writing and open-sourcing Lager was one of many steps we took in 2011 to address Riak’s (and Erlang’s) developer-friendliness. Client libraries were also on display in July. The Riak Java Client was given a makeover in response to user and customer demand, and Russell Brown and various community members continue to enhance the code. Ripple, Riak’s Ruby client, shared the spotlight. Sean Cribbs and his team of committers took over BashoWest for a week to hold the Ripple Hackathon, an event that contributed to what was a monumental year for Riak’s adoption in the Ruby Community.

We had our heads down in August, steadily grinding, only to re-emerge in September with a string of announcements about the code and features we were polishing off for the upcoming 1.0 release. The long awaited Secondary Indexing component of Riak was announced and chronicled by Rusty Klophaus; the work Bryan Fink was doing on Riak Pipe, our new MapReduce framework, was revealed in detail; Joseph Blomstedt, who we luckily snatched up after he released riak_zab, demonstrated the extensive work that he and the team had been doing to refine Riak’s Clustering Capabilities.

Then, to end the month, mere hours before September concluded, we released Riak 1.0. The culmination of years of hard work and innovation from Basho and our community, this release was the biggest in the history of Riak and it’ll be some time before any of us forget this day. Glance at the release notes to grasp the scope of this release if you’re not already running the code.

Onto October and November. The 111 Minna Gallery in downtown San Francisco played host to the Riak 1.0 Release party, an event attended by the entire Basho team along with almost 200 users, customers, and Basho supporters. Several weeks later we publicized the work we had been doing on the Riak-Hadoop Connector. In November it was also revealed that Basho’s Director of Engineering Dave “Dizzy” Smith had been named Erlang User of the Year for his work on Rebar. Also noteworthy from November: Scott Lystig Fritchie, another member of the Basho Engineering Team, shared some details on the work he and others were doing (and continue to do) to get DTrace added to Erlang.

Which brings us to December. Two big things happened this month, both on the 15th. First the Basho Developer Advocate Team released Riaknostic, a chunk of code, complete with beautiful documentation, intended to help eliminate operational issues before your Riak cluster goes live. Also on the 15th, Community Member Mathias Meyer released the Riak Handbook, a short yet near-comprehensive guide to using Riak, and the first extensive publication dedicated solely to Riak. (I’m told sales are booming.)

Community, Contributions, and Production Deployments

And even with all this, we are nothing without our community of users, customers, and contributors. As our COO Tony Falco has been known to say, we have a “community that sustains us with hard work and positivity.” Going into 2012, this could not be any less true.

The number of contributors to the projects that compose and are connected to Riak grew in a massive way, and the THANKS file now contains 170 names, up from about 40 at the beginning of the year. To date, hundreds of organizations and companies have contributed to the codebase, including Comcast, Yammer, GitHub, Trifork, Rails Machine, DISQUS, Formspring, Simple, Clipboard, Boundary, The Fedora Project, SEOmoz, SpawnGrid, Spreedly, ShowYou, Apollo Group… The list goes on. On the individual level, a special thanks is also owed to Tuncer Ayaz, for his dedication to Riak and Rebar.

Client library work that helped drive grass-roots adoption was done by people like Francisco Treacy (riak-js), Greg Stein, Soren Hansen, and Gilles Devaux, and Brett Hoerner (Riak’s Python Client), and Jeremiah Peschka and OJ Reeves, who took it upon themselves bring Riak to the .NET world. The Riak PHP Client was and continues to be refined by developers like KevBurnsJr, Jonathan Langevin, Mark Steele, and Eric Stevens.

We are immensely lucky, thankful, and grateful for these and future contributions, and we consider it a privilege to have you spend time working on and with Riak. Thank You!

Production deployment numbers also exploded, to the point where we are now comfortable saying there are more than 1000 Riak clusters either in production or that will be there very soon. Some of the noteworthy use cases:

  • Voxer relied on Riak when they needed to scale their backend to handle billions of daily requests on their way to becoming the number one Social Networking application on the iOS.
  • The Danish Government turned to Riak when they needed a datastore that could be trusted with the prescription records of their entire citizenry.
  • Bump, the #7 free iPhone Application of all time, switched to Riak when they realized their existing infrastructure wasn’t sustainable.
  • Yammer, which counts 80% of the Fortune 100 as customers, selected Riak to provide notifications to its millions of users
  • DotCloud chose Riak to scale critical components of their internal infrastructure.
  • ShowYou built out two Riak clusters to power their social video application and have developed a custom storage backend with integrated search and analytic capabilities.

These represent just a small portion of the hallmark deployments. We would need many more blog posts to provide details on all of them. Please add your use case details to the comments if you’re feeling compelled.

We also saw the appearance of a healthy dialog (the “good” and the “bad”) around what it takes to run Riak in production, driven by companies like Inaka Networks, The NetCircle, Production Scale/Solution Set, and Linklfluence sharing their stories. Riak isn’t perfect yet, and you’re driving us to make it better.

Another stat worth sharing: at least nine of the FORTUNE 100 have either deployed Riak or are committed to deploying it for services that generate revenue.

Onto 2012

And so, with this, we close out a momentous 2011 knowing full well that what we have planned for 2012 will make the accomplishments and growth we saw over the past 12 months pale in comparison. Are we “market leaders”? Hard to say. This is not a title we can bestow upon ourselves. But this past year’s successes, coupled with the code, partnerships, new hires, products, customer announcements, and initiatives we have in the pipeline for 2012 have us feeling very good about the coming year.

Thanks for being a part of Riak. Happy Holidays.

On behalf of the entire Basho Team and our Community,

Mark

Community Manager

Announcing Riaknostic, your Riak Doctor

December 15, 2011

As much as we love Riak and talk about how easy it is to put into production, the activity on the #riak IRC roommailing list and customer support issues has a way of telling us there are still some issues to be ironed out. Sometimes, things just go wrong and you want to know what they are. It can be frustrating to figure those things out if you don’t know what to look for.

We have felt this pain, too, and wished for a way that our customers and community users could discover and remedy problems on their own, without having to resort to the mailing list or IRC chat. That’s why today, we’re happy to announce the first public release of Riaknostic.

NOTE: Riaknostic is very new and still under development. It will not be fully tested and integrated until the next release of Riak. While it diagnostics are low-impact, do take care when running it on your system.

What is Riaknostic?

Riaknostic is an Erlang script (escript) that runs a series of “diagnostics” or “checks”, inspecting your operating system and Riak installation for known potential problems and then printing suggestions for how to fix those problems. Riaknostic will NOT fix those problems for you, it’s only a tool for diagnostics. Some of the things it checks are:

  • How much memory does the Riak process currently use?
  • Do Riak’s data directories have the correct permissions?
  • Did the Riak node crash in the past and leave a dump file?

Some of the checks require that the Riak node be running, but others simply work with the operating system. It has a whole bunch of other options and details that I won’t cover here, but you can read about them on the homepage or in the API documentation. Of course, we encourage you to read the source code and fork the project as well.

As is typical of Basho software, Riaknostic is free, open-source, and licensed under the Apache 2.0 license. You can use Riaknostic today, but full integration will be shipped in the next release of Riak (v1.1).

Extending Riaknostic For Fun and Wearable-Profit

We’ve constructed Riaknostic in such a way that it’s fairly easy to incorporate new diagnostics. Our goal is to get it to the point where Riaknostic can diagnose the majority of errors and misconfigurations that might befall your Riak cluster before you’re in production. Basho will continue to add to the library, but we can’t make it totally robust without your help. So, we will happily distribute various pieces of SWAG (including but not limited to t-shirts and stickers), to anyone who spends time extending it. Keep in mind that you’ll have to write a touch of Erlang to make this happen, but, if you’re new to the language, this is probably one of the best ways to learn. Take a look at the existing modules for examples and inspiration.

For precise instructions on how to Contribute to the library, have a look at the Riaknostic README on the GitHub Repo. And if you have any issues with it, send a message along to the Riak Mailing List or join us in #riak IRC.

Enjoy!

Sean, on behalf of all Developer Advocates

Riak and Hadoop Word Count Example

December 1, 2011

My last post was a brief description of the integration between Riak and Hadoop MapReduce. This, the follow up post, will be a bit more hands on and walk you through an example riak-hadoop MapReduce job. If you want to have a go, you need to follow these steps. I won’t cover all the details of installing everything, but I’ll point you at resources to help.

Getting Set Up

First you need both a Riak and Hadoop install. I went with a local devrel Riak cluster and a local Hadoop install in pseudo distributed mode. (A singleton node of both Hadoop and Riak installed from a package will do, if you’re pushed for time.)

The example project makes use of Riak’s Secondary Indexes (“2i”) so you will need to switch the backend on Riak to use LevelDB. (NOTE: This is not a requirement for riak-hadoop, but this demo uses 2i, which does require it). So, for each Riak node, change the storage backend from bitcask to eleveldb by editing the app.config.

riak_kv, [
%% Storage_backend specifies the Erlang module defining the storage
%% mechanism that will be used on this node.
{storage_backend, riak_kv_eleveldb_backend},
%% ...and the rest
]},

Now start up your Riak cluster, and start up Hadoop:

for d in dev/dev*; do $d/bin/riak start; done
cd $HADOOP_HOME
bin/start-dfs.sh
bin/start-mapred.sh

Next you need to pull the riak-hadoop-wordcount sample project from GitHub. Checkout the project and build it:

git clone https://github.com/russelldb/riak-hadoop-wordcount
cd riak-hadoop-wordcount
mvn clean install

A Bit About Dependencies…

Warning: Hadoop has some class loading issues. There is no class namespace isolation. The Riak Java Client depends on Jackson for JSON handling, and so does Hadoop, but different versions (naturally). When the Riak-Hadoop driver is finished it will come with a custom classloader. Until then, however, you’ll need to replace your Hadoop lib/jackson*.jar libraries with the ones in the lib folder of this repo on yourJobTracker / Namenode only. On your tasknodes, you need only remove the Jackson jars from your hadoop/lib directory, since the classes in the job jar are at least loaded. (There is an open bug about this in Hadoop’s JIRA. It has been open for 18 months, so I doubt it will be fixed anytime soon. I’m very sorry about this. I will address it in the next version of the driver.)

Loading Data

The repo includes a copy of Mark Twain’s Adventures Of Huckleberry Finn from Project Gutenberg, and a class that chunks the book into chapters and loads the chapters into Riak.

To load the data run the Bootstrap class. The easiest way is to have maven do it for you:

mvn exec:java -Dexec.mainClass=”com.basho.riak.hadoop.Bootstrap”
-Dexec.classpathScope=runtime

This will load the data (with an index on the Author field.) Have a look at the Chapter class if you’d like to see how easy it is to store a model instance in Riak.

Bootstrap assumes that you are running a local devrel cluster. If your Riak install isn’t listening on the PB interface on 127.0.0.1 on port 8081 then you can specify the transport and address like this:

mvn exec:java -Dexec.mainClass=”com.basho.riak.hadoop.Bootstrap”
-Dexec.classpathScope=runtime -Dexec.args=”[pb|http PB_HOST:PB_PORT|HTTP_URL]”

That should load one item per chapter into a bucket called wordcount. You can check it succeeded by running (being sure to adjust the url based on your configuration):

curl http://127.0.0.1:8091/riak/wordcount?keys=stream

Package The Job Jar

If you’re running the devrel cluster locally you can just package up the jar now with:

mvn clean package

Otherwise, first edit the RiakLocations in the RiakWordCount class to point at your Riak cluster/node, e.g.

conf = RiakConfig.addLocation(conf, new RiakPBLocation(“127.0.0.1″, 8081));
conf = RiakConfig.addLocation(conf, new RiakPBLocation(“127.0.0.1″, 8082));

Then simply package the jar as before.

Run The job

Now we’re finally ready to run the MapReduce job. Copy the jar from the previous step to your hadoop install directory and kick off the job.

cp target/riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar $HADOOP_HOME
cd $HADOOP_HOME
bin/hadoop jar riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar

And wait… Hadoop helpfully provides status messages as it distributes the code and orchestrates the MapReduce execution.

Inspect The Results

If all went well there will be a bucket in your Riak cluster named wordcount_out. You can confirm it is populated by listing its keys:

curl http://127.0.0.1:8091/riak/wordcount_out?keys=stream

Since the WordCountResult output class has RiakIndex annotations for both the count and word fields, you can perform 2i queries on your data. For example, this should give you an idea of the most common words in Huckleberry Finn:

curl 127.0.0.1:8091/buckets/wordcount_out/index/count_int/1000/3000

Or, if you wanted to know which "f" words Twain was partial too, run the following:

curl 127.0.0.1:8091/buckets/wordcount_out/index/word_bin/f/g

Summary

We just performed a full roundtrip MapReduce, starting with data stored in Riak, feeding it to Hadoop for the actual MapReduce processing, and then storing the results back into Riak. It was a trivial task with a small amount of data, but it illustrates the principle and the potential. Have a look at the RiakWordCount class. You can see that only a few lines of configuration and code are needed to perform a Hadoop MapReduce job with Riak data. Hopefully the riak-hadoop-wordcount repo can act as a template for some further exploration. If you have any trouble running this example, please let me know by raising a GitHub issue against the project or by jumping on the Riak Mailing List to ask some questions.

Enjoy.

Russell

Riak and Hadoop (Sitting in a tree)

November 29, 2011

It has been pointed out on occasion that Riak MapReduce isn’t real “MapReduce”, often with reference to Hadoop, which is. There are many times that Riak’s data processing pipeline is exactly what you want, but in case it isn’t, and you want to leverage existing Hadoop expertise and investment, you may now use Riak as an input/output, instead of HDFS.

This started off as a tinkering project, and it is currently released as riak-hadoop-0.2. I wouldn’t recommend it for production use today, but it is ready for exploratory work, whilst we work on some more serious integration for the future.

Input

Hadoop M/R usually gets its input from HDFS, and writes its results to HDFS. Riak-hadoop is a library that extends Hadoop’s InputFormat and OutputFormat classes so that a Riak cluster can stand in for HDFS in a Hadoop M/R job. The way this works is pretty simple. When defining your Hadoop job, you declare the InputFormat to be of type RiakInputFormat. You configure you cluster members and locations using the JobConf, and a helper class called RiakConfig. Your Mapper class must also extend RiakMapper, since there are some requirements for handling eventual consistency that you must satisfy. Apart from that, you code your Map method as if for a typical Hadoop M/R job.

Keys, for the splits

When Hadoop creates a Mapper task it assigns an InputSplit to that task. An input split is the subset of data that the Mapper will process. In Riak’s case this is a set of keys. But how do we get the keys to Map over? When you configure your job, you specify a KeyLister. You can use any input to Hadoop M/R that you would use for Riak M/R: provide a list of bucket/key pairs, a 2i query, a Riak Search query, or, ill advisedly, a bucket. The KeyLister will fetch the keys for the job and partition them into splits for the Mapper tasks. The Mapper tasks then access the data for the keys using a RiakRecordReader. The record reader is a thin wrapper around a Riak client, it fetches the data for the current key when the Hadoop framework asks.

Output

In order to output reduce results to Riak your Reducer only need implement the standard Reducer interface. When you configure the Job, just specify that you wish to use the RiakOutputFormat, and declare an output bucket as a target for results. The keys/values from your reduce will then be written to Riak as regular Riak objects. You can even specify secondary indexes, Riak metadata and Riak links on your output values, thanks to the Riak Java Client’s annotations and object mapping (courtesy of Jackson’s object mapper.)

Hybrid

Of course you don’t need to use Riak for both input and output. You could read from HDFS, process and store results in Riak, or read from Riak and store results in HDFS.

Why do this?

This is really a proof of concept integration. It should be of immediate use to anyone who already has Hadoop knowledge and a Hadoop cluster. If you’re a Riak user with no Hadoop requirements right now, I’d say, don’t go there at once: setting up a Hadoop cluster is way more complex than running Riak, and maintaining it is, operationally, taxing. If, however, you already have Hadoop, adding Riak as a data source and sink is incredibly easy, and gives you a great, scalable, live database for serving reads and taking writes, and you can leverage your existing Hadoop investment to aggregate that data.

What next?

The thinking reader might be saying “Huh? You stream the data in and out over the network, piecemeal?”. Yes, we do. Ideally we’d do a bulk, incremental replication between Riak and Hadoop (and back) and that is the plan for the next phase of work.

Summary

Riak-Hadoop enables Hadoop users to use a Riak cluster as a source and sink for Hadoop M/R jobs. This exposes he entire Hadoop toolset to Riak data (including the query languages like Hive and Pig!) This is only a first phase pass at the integration problem, and though usable today, smarter sync is coming.

Please clone, build, and play with this project. Have at it. There’s a follow up post with a look at an example Word Count Hadoop Map/Reduce job coming soon. If you can’t wait, just add a dependency on riak-hadoop, version 0.2 to your pom.xml and get started. Let me know how you get on, via the Riak mailing list.

Russell

Riak 1.0 Release Party Recap

November 17, 2011

The Riak 1.0 Release Party happened just over a week ago in San Francisco. It was an exceptional evening, and we were able to bring together the Basho Team and a huge number of local Riak community members to celebrate the release.

In addition to the excellent food, drinks, company, and conversation, we had two great talks. The first was delivered by Basho’s CTO Justin Sheehy and he did about 20 minutes on the origins of Basho and Riak, and precisely how and why we got to Riak 1.0. After Justin concluded, Dave “Dizzy” Smith, Basho’s Director of Engineering, closed things up with some passionate words about where Riak and Basho are going and why he’s excited to be a part of it.

Most importantly, if you weren’t able to attend, we recorded the talks so no one would miss out on the action. They are well worth the 30 minutes and at the end of it you can call yourself a “Riak Historian”. You can find the video below. We also took some photos of the event. Those are below, too.

Enjoy, and thanks for being a part of Riak.

The Basho Team

Riak 1.0 Is Officially Released!

September 30, 2011

We are absolutely thrilled to report that as of today, Riak 1.0 is officially released and ready for your production applications!

Riak 1.0 packages are available. Go download one. And then go read the release notes because they are extensive and full of useful information highlighting the great work Basho has done since the last release.

There is already a lot of literature out there on the release, so here are the essentials to get you started.

The High Level Awesome

For those of you who need a refresher on the release, this 1.0 Slide Deck will give you a quick overview of why you should be excited about it. The big-ticket features are as follows:

Secondary Indexing

In 1.0 we added the ability to build secondary indexes on your data stored in Riak. We developed this functionality because, quite frankly, people needed a more powerful way to query their data.

Riak Pipe And Revamped MapReduce

Riak’s MapReduce functionality isn’t anything new, but we did a lot of work in this release to make the system more robust, performant, and resistant to failures. Riak Pipe is the new underlying processing layer that powers MapReduce, and you’ll be seeing a lot of cool features and functionality made possible as a result of it in the near future.

Lager

Usability is a huge focus for us right now, and logging is something that’s less-than-simple to understand in Erlang applications. To that end, we wrote a new logging framework for Erlang/OTP called Lager that is shipping with 1.0 and drastically reduces the headaches traditionally associated with Erlang logging and debugging.

Search Integration

Riak Search has been a supported Basho product for several releases now, but until 1.0 you were required to build it as a separate package. In 1.0 we’ve merged the search functionality into Riak proper. Enabling it is a simple one line change in a configuration file. Do this and you’ve got distributed, full text search capabilities on top of Riak.

Support for LevelDB

Riak provides for pluggable storage backends, and we are constantly trying to improve the options we offer to our users. Google released LevelDB some months back, and we started to investigate it as a possible addition to our suite of supported backends. After some rigorous testing, what we found is that LevelDB had some attractive functionality and performance characteristics compared to our existing offerings (mainly Innostore), and it will be shipping in 1.0. Bitcask is still the default storage engine, but LevelDB, aside from being an alternative for key/value storage, is being used as the backend behind the new Secondary Indexing functionality.

Cluster Membership

One of the most powerful components of Riak is riak_core, the distributed systems framework that, among many others things, enables Riak to scale horizontally. Riak’s scalability and operational simplicity are of paramount importance to us, and we are constantly looking to make this code and system even better. With that in mind, we did some major work in 1.0 to improve upon our cluster membership system and are happy to report that it’s now more stable and scalable than ever.

And So Much More …

Riak 1.0 is a massive accomplishment, and the features and code listed above are just the beginning of what this release has to offer. Take some time to read the lengthy release notes and you’ll see what we mean.

These improvements are many months in the making, and the bug fixes, new features, and added functionality make Riak (in our humble opinion) the best open source database available today.

Thank You, Community!

We did our best to ensure that the community was as big a part of this release as possible, and there’s no way the code and features would be this rock-solid without your help. Thanks for your usage, support, testing, debugging, and help with spreading the word about Riak and 1.0.

And 1.0 is just the beginning. We’ll continue to refine and build Riak over the coming months, and we would love for you to be a part of it if you’re not already. Some ways to get involved:

Download Riak on Your Preferred Platform

Read the Riak Wiki

Watch Riak on GitHub

Thanks for being a part of Riak!