Tag Archives: Basho

Ryan Zezeski Has Joined The Basho Team

April 13, 2011

We are pleased to announce that Ryan Zezeski has joined the Basho Team!

Ryan has been coding since 14 and was hooked after writing his first program in Visual Basic 3.0. (It was an add-on for AOL that would automatically block a user that spammed the chat room.) He wrote his first line of Erlang in late June of 2010 on a plane ride to Mountain View, CA, and his fingers haven’t stopped since.

In the last six months Ryan has been hard at work putting various pieces of Basho software into production at AOL including Rebar, Riak, Riak Search, Riak Core, and Webmachine and sent us numerous high quality patches for Luwak and Riak in the process. (It was his work on Luwak that initially caught our eye.) He’s planning to spend as much time as possible with Riak Core and expose it to the greater public through his working blog “try try try.”

On a personal note, Ryan resides in downtown Baltimore. If you happen to see a guy in Federal Hill sporting a black T-shirt that says “Riak”, don’t be scared to say hello.

Ryan is on Twitter as @rzezeski and on GitHub as rzezeski.

Welcome, Ryan!

The Basho Team

Adam Hunter Has Joined The Basho Team

March 24, 2011

Adam Hunter is the newest Basho Developer Advocate!

Adam first got involved with Riak when he used it in conjunction with Ripple, Riak’s Ruby library, to build several applications at his previous position. (In addition to being a deeply-skilled Ruby developer, Adam has also spent some happy years writing PHP for fun and profit.) In the process, he started contributing patches and features to Ripple, and we liked his code and enthusiasm for the project so much that we extended him committer rights. Since then he has become an active and visible member of the Riak community, so we were quite pleased when he accepted the offer to come aboard.

Home base for Adam is Charlotte, North Carolina, so be sure to look him up if you’re in the area and are interested in getting an earful about Riak and distributed systems. You can also find him on Twitter and on GitHub as adamhunter.

Welcome, Adam!

The Basho Team

KillDashNine March Happening on Wednesday

March 5, 2011

In February we kicked off the KillDashNine drinkup. It was a huge success (turns out we aren’t the only ones who care about durability) and, as promised, we’ll be having another drinkup this month. On Wednesday, 3/9, we will be clinking glasses and sharing data loss horror stories at Bloodhound, located at 1145 Folsom Street here in San Francisco.

This month’s chosen cocktail is the *Data Eraser*, and it’s simple to make: 2 oz Vodka, 2 Oz Coffee Liqueur, 2 oz Tonic, and a dash of bitter frustration, anguish, and confusion (which is more or less how one feels when their data just disappears). And if you can’t make it, be sure to pour yourself a Data Eraser on 3/9 to take part in the festivities from wherever you happen to find yourself (or you can run your own local KillDashNine like Marten Gustafson did in Stockholm last month.

Registration details for the event are here, so be sure to RSVP if you’re planning to join us. In the mean time, spin up a few nodes of your favorite database and try your hand at terminating some processes with the help of our favorite command: _kill-9_.

Long Live Durability!



Ryan Zezeski Added As Community Wiki Committer

March 4, 2011

Anyone can contribute to the Riak Wiki: it’s maintained and deployed from a public GitHub repository, so everyone is free to fork and send us a pull request to make changes. There is, however, a group of community members who are given commit access to this repo, and I’m pleased to announce that Ryan Zezeski is now part of this group.

Ryan first became involved with Riak several months ago when he selected it as the production data store for a component of the ad-serving platform he works on during the daytime hours. Since then he has become an active and visible member of our community, contributing numerous patches to Luwak and providing guidance to new and existing users on the Riak Mailing list and in the Riak IRC Channel. In short, he knows his Riak and we are thrilled to have him on board as a Community Committer.

Welcome, Ryan! We are looking forward to your contributions.



Mathias Meyer Has Joined The Basho Team

March 2, 2011

We are absolutely thrilled to announce that Mathias Meyer, known to some of you as Roidrage, has joined the team here at Basho as a Developer Advocate.

Mathias has dabbled with databases of many sorts over the years, and spent the last two years automating the heck out of cloud infrastructure at Scalarium, a company he co-founded where he will continue to play an advisory role. Along the way he developed a certain fascination towards distributed databases and a secret crush on Riak.

His spare time is currently devoted to writing the NoSQL Handbook:, a project into which he is pouring his brains, soul, and an abundance of coffee. (On a related note, he has also agreed to take on the role of Coffee Advocate at Basho. Expect a webcast real soon.)

Mathias is based in Berlin and, as such, you can expect to see a lot of him at various events and conferences across Europe flying the Basho flag. His first stateside appearance as a member of the Basho team will be at JSConf, where he will be serving as the official conference photographer. (Basho also happens to be sponsoring both JSConf and NodeConf, by the way.)

You can find Mathias on GitHub as mattmatt and on Twitter as Roidrage.

Welcome, Mathias!

The Basho Team

Announcing KevBurnsJr as a PHP Client Committer

February 28, 2011

We just added Kevin Burns, who goes by KevBurnsJr on Github, Twitter and the #riak IRC room on irc.freenode.net, as a committer to the Basho-supported Riak PHP Client.

Kevin has been hard at work over the past few weeks adding some great functionality to the PHP client and has even kicked off porting Ripple, Basho’s Ruby Client ODM, to PHP. Suffice it to say that we at Basho are excited about Kevin’s participation and involvement with Riak and our PHP code.

Some relevant code:

* Riak’s PHP Client
* Port of Ripple to PHP

Thank, Kev! We are looking forward to your contributions.


Data Durability Is Not An After-market Add-on; Announcing KillDashNine

February 7, 2011

We started Basho Technologies with an idea as simple and timeless as an honest day’s work: databases shouldn’t lose data. Sounds radical, we know, but some people think losing data is fine as long as they have cool coffee mugs. (By their own admission, these folks have built “databases” that run the risk of losing data if users issue a simple “kill -9″ command.)

We at Basho chose a different approach. We spent next to nothing on marketing for the last three years. Instead, we chose to invest our money and time developing a database that, among other things, offers the expected guarantees and safeguards typically associated with data storage technologies. Lose a machine, lose a rack, lose a data center — and your data is safe. Issue the “kill -9″ command on a Riak node and you will see what we mean.

In celebration of our commitment to protecting data, we’ll be hosting something we are calling “KillDashNine” parties on the 9th of every month in various cities — wherever data loss is shrugged off and data durability is an after-market add-on.

The first KillDashNine party is happening this Wednesday, 2/9, in San Francisco If you’re in the area and believe in classic drinks and out-of-the box data durability, you should join us. This month’s featured drink is the “Dash Dur-Ty Martini” and anyone who is confident that issuing a “kill -9″ on their running database won’t result in data loss gets a Dash Dur-Ty Martini on Basho.

(If you’re interested in helping get a KillDashNine event started in your area, email mark@basho.com.)


Creating a Local Riak Cluster with Vagrant and Chef

February 4, 2011

The “Riak Fast Track” has been around for at least nine months now, and lots of developers have gotten to know Riak that way, building their own local clusters from the Riak source. But there’s always been something that has bothered me about that process, namely, that the developer has to build Riak herself. Basho provides pre-built packages on downloads.basho.com for several Linux distributions, Solaris, and Mac OS/X, but these have the limitation of only letting you run one node on a machine.

I’ve been a long-time fan of Chef the systems and configuration management tool by Opscode, especially for the wealth of community recipes and vibrant participation. It’s also incredibly easy to get started with small Chef deployments with Opscode’s Platform, which is free for up to 5 managed machines.

Anyway, as part of updating Riak’s Chef recipe last month to work with the 0.14.0 release, I discovered the easiest way to test the recipe — without incurring the costs of Amazon EC2 — was to deploy local virtual machines with Vagrant. So this blog post will be a tutorial on how to create your own local 3-node Riak cluster with Chef and Vagrant, suitable for doing the rest of the Fast Track.

Before we start, I’d like to thank Joshua Timberman and Seth Chisamore from Opscode who helped me immensely in preparing this.

Step 1: Install VirtualBox

Under the covers, Vagrant uses VirtualBox, which is a free virtualization product, originally created at Sun. Go ahead and download and install the version appropriate for your platform:

Step 2: Install Vagrant and Chef

Now that we have VirtualBox installed, let’s get Vagrant and Chef. You’ll need Ruby and Rubygems installed for this. Mac OS/X comes with these pre-installed, but they’re easy to get on most platforms.

Now that you’ve got them both installed, you need to get a virtual machine image to run Riak from. Luckily, Opscode “has provided some images for us that have the 0.9.12 Chef gems preinstalled. Download the Ubuntu 10.04 image and add it to your local collection:

Step 3: Configure Local Chef

Head on over to Opscode and sign up for a free Platform account if you haven’t already. This gives you access to the cookbooks site as well as the Chef admin UI. Make sure to collect your “knife config” and “validation key” from the “Organizations” page of the admin UI, and your personal “private key” from your profile page. These help you connect your local working space to the server.

Now let’s get our Chef workspace set up. You need a directory that has specific files and subdirectories in it, also known as a “Chef repository”. Again Opscode has made this easy on us, we can just clone their skeleton repository:

Now let’s put the canonical Opscode cookbooks (including the Riak one) in our repository:

Finally, put the Platform credentials we downloaded above inside the repository (the .pem files will be named differently for you):

Step 4: Configure Chef Server

Now we’re going to prep the Chef Server (provided by Opscode Platform) to serve out the recipes needed by our local cluster nodes. The first step is to upload the two cookbooks we need using the *knife* command-line tool, shown in the snippet below the next paragraph. I’ve left out the output since it can get long.

Then we’ll create a “role” — essentially a collection of recipes and attributes — that will represent our local cluster nodes, and call it “riak-vagrant”. Using knife role create will open your configured EDITOR (mine happens to be emacs) with the JSON representation of the role. The role will be posted to the Chef server when you save and close your editor.

The key things to note about what we’re editing in the role below are the “run list” and the “override attributes” sections. The “run list” tells what recipes to execute on a machine that receives the role. We configure iptables to run with Riak, and of course the relevant Riak recipes. The “override attributes” change default settings that come with the cookbooks. I’ve put explanations inline, but to summarize, we want to bind Riak to all network interfaces, and put it in a cluster named “vagrant” which will be used by the “riak::autoconf” recipe to automatically join our nodes together.

Step 5: Setup Vagrant VM

Now that we’re ready on the Chef side of things, let’s get Vagrant going. Make three directories inside your Chef repository called dev1, dev2, and dev3, just like from the Fast Track. Change directory inside dev and run vagrant init. This will create a Vagrantfile which you should edit to look like this one (explanations inline again):

Remember: change any place where it says ORGNAME to match your Opscode Platform organization.

Step 6: Start up dev1 Now we’re ready to see if all our preparation has paid off:

If you see lines at the end of the output like the ones above, it worked! If it doesn’t work the first time, try running vagrant provision from the command line to invoke Chef again. Let’s see if our Riak node is functional:


Step 7: Repeat with dev2, dev3

Now let’s get the other nodes set up. Since we’ve done the hard parts already, we just need to copy the Vagrantfile from dev1/ into the other two directories and modify them slightly.

The easiest way to describe the modifications is in a table:

| Line | dev2 | dev3 | Explanation |
| 7 | “” | “” | Unique IP addresses |
| 11 (last number) | 8092 | 8093 | HTTP port forwarding |
| 12 (last number) | 8082 | 8083 | PBC port forwarding |
| 40 | “riak-fast-track-2″ | “riak-fast-track-3″ | Unique chef node name |
| 48 | “riak@″ | “riak@″ | Unique Riak node name |

With those modified, start up dev2 (run vagrant up inside dev2/) and watch it connect to the cluster automatically. Then repeat with dev3 and enjoy your local Riak cluster!


Beyond just being a demonstration of cool technology like Chef and Vagrant, you’ve now got a developer setup that is isolated and reproducible. If one of the VMs gets too messed up, you can easily recreate the whole cluster. It’s also easy to get new developers in your organization started using Riak since all they have to do is boot up some virtual machines that automatically configure themselves. This Chef configuration, slightly modified, could later be used to launch staging and production clusters on other hardware (including cloud providers). All in all, it’s a great tool to have in your toolbelt.


Fixing the Count

January 26, 2011

Many thanks to commenter Mike for taking up the challenge I offered in my last post. The flaw I was referring to was, indeed, the possibility that Luwak would split one of my records across two blocks.

I can check to see if Luwak has split any records with another simple map function:

(riak@> Fun = fun(L,O,_) ->
(riak@> D = luwak_block:data(L),
(riak@> S = re:run(D, “^([^r]*)”,
(riak@> [{capture, all_but_first, binary}]),
(riak@> P = re:run(D, “n([^r]*)$”,
(riak@> [{capture, all_but_first, binary}]),
(riak@> [{O, S, P}]
(riak@> end.

This one will return a 3-element tuple consisting of the block offset, anything before the first carriage return, and anything after the last linefeed. Running that function via map/reduce on my data, I see that it’s not only possible for Luwak to split a record across a block boundary, it’s also extremely likely:

(riak@> {ok, R} = C:mapred({modfun, luwak_mr, file, <<“1950s”>>},
(riak@> [{map, {qfun, Fun}, none, true}]).

(riak@> lists:keysort(1, R).
{match,[<<"start,walll101,"Lee Walls",1,7,">>]}},

There are play records at the ends of the first, second, and third blocks (as well as others that I cut off above). This means that Joe Pignatano, Eddie Mathews, and Harvey Kuenn are each missing a play in their batting average calculation, since my map function only gets to operate on the data in one block at a time.

Luckily, there are pretty well-known ways to fix this trouble. The rest of this post will describe two: chunk merging and fixed-length records.

Chunk Merging

If you’ve watched Guy Steel’s recent talk about parallel programming, or read through the example luwak_mr file luwak_mr_words.erl, you already know how chunk-merging works.

The basic idea behind chunk-merging is that a map function should return information about data that it didn’t know how to handle, as well as an answer for what it did know how to handle. A second processing step (a subsequent reduce function in this case) can then match up those bits of unhandled data from all of the different map evaluations, and get answers for them as well.

I’ve updated baseball.erl to do just this. The map function now uses regexes much like those earlier in this post to produce “suffix” and “prefix” results for unhandled data at the start and end of the block. The reduce function then combines these chunks and produces additional hit:at-bat results that can be summed with the normal map output.

For example, instead of the simple count tuple a map used to produce:

[{5, 50}]

The function will now produce something like:

[{5, 50},
{suffix, 2000000, <<"e101,??,,S7/G">>},
{prefix, 3000000, <<"play,7,1,kue">>}]

Fixed-length Records

Another way to deal with boundary-crossing records is to avoid them entirely. If every record is exactly the same length, then it’s possible to specify a block size that is an even multiple of the record length, such that record boundaries will align with block boundaries.

I’ve added baseball_flr.erl to the baseball project to demonstrate using fixed-length records. The two records needed from the “play” record for the batting average calculation are the player’s Retrosheet ID (the third field in the CSV format) and the play description (the sixth CSV field). The player ID is easy to handle: it’s already a fixed length of eight characters. The play description is, unfortunately, variable in length.

I’ve elected to solve the variable-length field problem with the time-honored solution of choosing a fixed length larger than the largest variation I have on record, and padding all smaller values out to that length. In this case, 50 bytes will handle the play descriptions for the 1950s. Another option would have been to truncate all play descriptions to the first two bytes, since that’s all the batting average calculation needs.

So, the file contents are no longer:


but are now:


(though a zero is used instead of a ‘.’ in the actual format, and there are also no line breaks).

Setting up the block size is done at load time in baseball_flr:load_events/1. The map function to calculate the batting average on this format has to change the way in which it extracts each record from the block, but the analysis of the play data remains the same, and there is no need to worry about partial records. The reduce
function is exactly the same as it was before learning about chunks (though the chunk-handling version would also work; it just wouldn’t find any chunks to merge).

Using this method does require reloading the data to get it in the proper format in Riak, but this format can have benefits beyond alleviating the boundary problem. Most notably, analyzing fixed-length records is usually much faster than analyzing variable-length, comma-separated records, since the record-splitter doesn’t have to search for the end of a record — it knows exactly where to find each one in advance.


Now that I have solutions to the boundary problems, I can correctly award Harvey Kuenn’s 1950s batting average as:

(riak@> baseball:batting_average(<<“1950s”>>, <<“kuenh101″>>).
(riak@> baseball_flr:batting_average(<<“1950s_flr”>>, <<“kuenh101″>>).

instead of the incorrect value given by the old, boundary-confused code:

(riak@> baseball:batting_average(<<“1950s”>>, <<“kuenh101″>>).

… wait. Did I forget to reload something? Maybe I better check the counts before division. New code:

(riak@> C:mapred({modfun, luwak_mr, file, <<“1950s_flr”>>},
(riak@> [{map, {modfun, baseball_flr, ba_map},
(riak@> <<"kuenh101">>, false},
(riak@> {reduce, {modfun, baseball_flr, ba_reduce},
(riak@> none, true}]).

old code:

(riak@> C:mapred({modfun, luwak_mr, file, <<“1950s”>>},
(riak@> [{map, {modfun, baseball, ba_map},
(riak@> <<"kuenh101">>, false},
(riak@> {reduce, {modfun, baseball, ba_reduce},
(riak@> none, true}]).

Aha: 1231 hits from both, but the new code found an extra at-bat — 4322 instead of 4321. The division says 0.28482 instead of 0.28488. I introduced more error by coding bad math (truncating instead of rounding) than I did by missing a record!

This result highlights a third method of dealing with record splits: ignore them. If the data you are combing through is statistically large, a single missing record will not change your answer significantly. If completely ignoring them makes you too squeemish, consider adding a simple “unknowns” counter to your calculation, so you can compute later how far off your answer might have been.

For example, instead of returning “suffix” and “prefix” information, I might have returned a simpler “unknown” count every time a block had a broken record at one of its ends (instead of a hit:at-bat tuple, a hit:at-bat:unknowns tuple). Summing these would have given me 47, if every boundary in my 48-block file broke a record. With that, I can say that if every one of those broken records was a hit for Harvey, then his batting average might have been as high as (1231+47)/(4321+47)=0.2926. Similarly, if every one of those broken records was a non-hit at-bat for Harvey, then his batting average might have been as low as 1231/(4321+47)=0.2818.

So, three options for you: recombine split records, avoid split records, or ignore split records. Do what your data needs. Happy map/reducing!


A Short Survey For Developers

December 29, 2010

The Dev team here at Basho is in the process of prioritizing some code and new feature development. So, we wanted your opinion on it. We threw together a short, simple survey to get some feedback on where we should be spending our time.

Whether you’re running Riak in production right now or only considering it for a future app, we want your feedback. It shouldn’t take you more than three minutes and it will greatly help us over the coming months.

Let us know if have any questions and thanks for participating.

The Basho Team