Tag Archives: Basho

Ryan Zezeski Added As Community Wiki Committer

March 4, 2011

Anyone can contribute to the Riak Wiki: it’s maintained and deployed from a public GitHub repository, so everyone is free to fork and send us a pull request to make changes. There is, however, a group of community members who are given commit access to this repo, and I’m pleased to announce that Ryan Zezeski is now part of this group.

Ryan first became involved with Riak several months ago when he selected it as the production data store for a component of the ad-serving platform he works on during the daytime hours. Since then he has become an active and visible member of our community, contributing numerous patches to Luwak and providing guidance to new and existing users on the Riak Mailing list and in the Riak IRC Channel. In short, he knows his Riak and we are thrilled to have him on board as a Community Committer.

Welcome, Ryan! We are looking forward to your contributions.

Mark

 

Mathias Meyer Has Joined The Basho Team

March 2, 2011

We are absolutely thrilled to announce that Mathias Meyer, known to some of you as Roidrage, has joined the team here at Basho as a Developer Advocate.

Mathias has dabbled with databases of many sorts over the years, and spent the last two years automating the heck out of cloud infrastructure at Scalarium, a company he co-founded where he will continue to play an advisory role. Along the way he developed a certain fascination towards distributed databases and a secret crush on Riak.

His spare time is currently devoted to writing the NoSQL Handbook:, a project into which he is pouring his brains, soul, and an abundance of coffee. (On a related note, he has also agreed to take on the role of Coffee Advocate at Basho. Expect a webcast real soon.)

Mathias is based in Berlin and, as such, you can expect to see a lot of him at various events and conferences across Europe flying the Basho flag. His first stateside appearance as a member of the Basho team will be at JSConf, where he will be serving as the official conference photographer. (Basho also happens to be sponsoring both JSConf and NodeConf, by the way.)

You can find Mathias on GitHub as mattmatt and on Twitter as Roidrage.

Welcome, Mathias!

The Basho Team

Announcing KevBurnsJr as a PHP Client Committer

February 28, 2011

We just added Kevin Burns, who goes by KevBurnsJr on Github, Twitter and the #riak IRC room on irc.freenode.net, as a committer to the Basho-supported Riak PHP Client.

Kevin has been hard at work over the past few weeks adding some great functionality to the PHP client and has even kicked off porting Ripple, Basho’s Ruby Client ODM, to PHP. Suffice it to say that we at Basho are excited about Kevin’s participation and involvement with Riak and our PHP code.

Some relevant code:

* Riak’s PHP Client
* Port of Ripple to PHP

Thank, Kev! We are looking forward to your contributions.

Mark

Data Durability Is Not An After-market Add-on; Announcing KillDashNine

February 7, 2011

We started Basho Technologies with an idea as simple and timeless as an honest day’s work: databases shouldn’t lose data. Sounds radical, we know, but some people think losing data is fine as long as they have cool coffee mugs. (By their own admission, these folks have built “databases” that run the risk of losing data if users issue a simple “kill -9″ command.)

We at Basho chose a different approach. We spent next to nothing on marketing for the last three years. Instead, we chose to invest our money and time developing a database that, among other things, offers the expected guarantees and safeguards typically associated with data storage technologies. Lose a machine, lose a rack, lose a data center — and your data is safe. Issue the “kill -9″ command on a Riak node and you will see what we mean.

In celebration of our commitment to protecting data, we’ll be hosting something we are calling “KillDashNine” parties on the 9th of every month in various cities — wherever data loss is shrugged off and data durability is an after-market add-on.

The first KillDashNine party is happening this Wednesday, 2/9, in San Francisco If you’re in the area and believe in classic drinks and out-of-the box data durability, you should join us. This month’s featured drink is the “Dash Dur-Ty Martini” and anyone who is confident that issuing a “kill -9″ on their running database won’t result in data loss gets a Dash Dur-Ty Martini on Basho.

(If you’re interested in helping get a KillDashNine event started in your area, email mark@basho.com.)

Tony

Creating a Local Riak Cluster with Vagrant and Chef

February 4, 2011

The “Riak Fast Track” has been around for at least nine months now, and lots of developers have gotten to know Riak that way, building their own local clusters from the Riak source. But there’s always been something that has bothered me about that process, namely, that the developer has to build Riak herself. Basho provides pre-built packages on downloads.basho.com for several Linux distributions, Solaris, and Mac OS/X, but these have the limitation of only letting you run one node on a machine.

I’ve been a long-time fan of Chef the systems and configuration management tool by Opscode, especially for the wealth of community recipes and vibrant participation. It’s also incredibly easy to get started with small Chef deployments with Opscode’s Platform, which is free for up to 5 managed machines.

Anyway, as part of updating Riak’s Chef recipe last month to work with the 0.14.0 release, I discovered the easiest way to test the recipe — without incurring the costs of Amazon EC2 — was to deploy local virtual machines with Vagrant. So this blog post will be a tutorial on how to create your own local 3-node Riak cluster with Chef and Vagrant, suitable for doing the rest of the Fast Track.

Before we start, I’d like to thank Joshua Timberman and Seth Chisamore from Opscode who helped me immensely in preparing this.

Step 1: Install VirtualBox

Under the covers, Vagrant uses VirtualBox, which is a free virtualization product, originally created at Sun. Go ahead and download and install the version appropriate for your platform:

Step 2: Install Vagrant and Chef

Now that we have VirtualBox installed, let’s get Vagrant and Chef. You’ll need Ruby and Rubygems installed for this. Mac OS/X comes with these pre-installed, but they’re easy to get on most platforms.

Now that you’ve got them both installed, you need to get a virtual machine image to run Riak from. Luckily, Opscode “has provided some images for us that have the 0.9.12 Chef gems preinstalled. Download the Ubuntu 10.04 image and add it to your local collection:

Step 3: Configure Local Chef

Head on over to Opscode and sign up for a free Platform account if you haven’t already. This gives you access to the cookbooks site as well as the Chef admin UI. Make sure to collect your “knife config” and “validation key” from the “Organizations” page of the admin UI, and your personal “private key” from your profile page. These help you connect your local working space to the server.

Now let’s get our Chef workspace set up. You need a directory that has specific files and subdirectories in it, also known as a “Chef repository”. Again Opscode has made this easy on us, we can just clone their skeleton repository:

Now let’s put the canonical Opscode cookbooks (including the Riak one) in our repository:

Finally, put the Platform credentials we downloaded above inside the repository (the .pem files will be named differently for you):

Step 4: Configure Chef Server

Now we’re going to prep the Chef Server (provided by Opscode Platform) to serve out the recipes needed by our local cluster nodes. The first step is to upload the two cookbooks we need using the *knife* command-line tool, shown in the snippet below the next paragraph. I’ve left out the output since it can get long.

Then we’ll create a “role” — essentially a collection of recipes and attributes — that will represent our local cluster nodes, and call it “riak-vagrant”. Using knife role create will open your configured EDITOR (mine happens to be emacs) with the JSON representation of the role. The role will be posted to the Chef server when you save and close your editor.

The key things to note about what we’re editing in the role below are the “run list” and the “override attributes” sections. The “run list” tells what recipes to execute on a machine that receives the role. We configure iptables to run with Riak, and of course the relevant Riak recipes. The “override attributes” change default settings that come with the cookbooks. I’ve put explanations inline, but to summarize, we want to bind Riak to all network interfaces, and put it in a cluster named “vagrant” which will be used by the “riak::autoconf” recipe to automatically join our nodes together.

Step 5: Setup Vagrant VM

Now that we’re ready on the Chef side of things, let’s get Vagrant going. Make three directories inside your Chef repository called dev1, dev2, and dev3, just like from the Fast Track. Change directory inside dev and run vagrant init. This will create a Vagrantfile which you should edit to look like this one (explanations inline again):

Remember: change any place where it says ORGNAME to match your Opscode Platform organization.

Step 6: Start up dev1 Now we’re ready to see if all our preparation has paid off:

If you see lines at the end of the output like the ones above, it worked! If it doesn’t work the first time, try running vagrant provision from the command line to invoke Chef again. Let’s see if our Riak node is functional:

Awesome!

Step 7: Repeat with dev2, dev3

Now let’s get the other nodes set up. Since we’ve done the hard parts already, we just need to copy the Vagrantfile from dev1/ into the other two directories and modify them slightly.

The easiest way to describe the modifications is in a table:

| Line | dev2 | dev3 | Explanation |
| 7 | “33.33.33.12″ | “33.33.33.13″ | Unique IP addresses |
| 11 (last number) | 8092 | 8093 | HTTP port forwarding |
| 12 (last number) | 8082 | 8083 | PBC port forwarding |
| 40 | “riak-fast-track-2″ | “riak-fast-track-3″ | Unique chef node name |
| 48 | “riak@33.33.33.12″ | “riak@33.33.33.13″ | Unique Riak node name |

With those modified, start up dev2 (run vagrant up inside dev2/) and watch it connect to the cluster automatically. Then repeat with dev3 and enjoy your local Riak cluster!

Conclusions

Beyond just being a demonstration of cool technology like Chef and Vagrant, you’ve now got a developer setup that is isolated and reproducible. If one of the VMs gets too messed up, you can easily recreate the whole cluster. It’s also easy to get new developers in your organization started using Riak since all they have to do is boot up some virtual machines that automatically configure themselves. This Chef configuration, slightly modified, could later be used to launch staging and production clusters on other hardware (including cloud providers). All in all, it’s a great tool to have in your toolbelt.

Sean

Fixing the Count

January 26, 2011

Many thanks to commenter Mike for taking up the challenge I offered in my last post. The flaw I was referring to was, indeed, the possibility that Luwak would split one of my records across two blocks.

I can check to see if Luwak has split any records with another simple map function:

“`text
(riak@127.0.0.1)2> Fun = fun(L,O,_) ->
(riak@127.0.0.1)2> D = luwak_block:data(L),
(riak@127.0.0.1)2> S = re:run(D, “^([^r]*)”,
(riak@127.0.0.1)2> [{capture, all_but_first, binary}]),
(riak@127.0.0.1)2> P = re:run(D, “n([^r]*)$”,
(riak@127.0.0.1)2> [{capture, all_but_first, binary}]),
(riak@127.0.0.1)2> [{O, S, P}]
(riak@127.0.0.1)2> end.
“`

This one will return a 3-element tuple consisting of the block offset, anything before the first carriage return, and anything after the last linefeed. Running that function via map/reduce on my data, I see that it’s not only possible for Luwak to split a record across a block boundary, it’s also extremely likely:

“`text
(riak@127.0.0.1)3> {ok, R} = C:mapred({modfun, luwak_mr, file, <<”1950s”>>},
(riak@127.0.0.1)3> [{map, {qfun, Fun}, none, true}]).

(riak@127.0.0.1)4> lists:keysort(1, R).
[{0,
{match,[<<"BAL,A,Baltimore,Orioles">>]},
{match,[<<"play,4,0,pignj101">>]}},
{1000000,
{match,[<<",00,,NP">>]},
{match,[<<"play,3,1,math">>]}},
{2000000,
{match,[<<"e101,??,,S7/G">>]},
{match,[<<"play,7,1,kue">>]}},
{3000000,
{match,[<<"nh101,??,,4/L">>]},
{match,[<<"start,walll101,"Lee Walls",1,7,">>]}},
…snip…
“`

There are play records at the ends of the first, second, and third blocks (as well as others that I cut off above). This means that Joe Pignatano, Eddie Mathews, and Harvey Kuenn are each missing a play in their batting average calculation, since my map function only gets to operate on the data in one block at a time.

Luckily, there are pretty well-known ways to fix this trouble. The rest of this post will describe two: chunk merging and fixed-length records.

Chunk Merging

If you’ve watched Guy Steel’s recent talk about parallel programming, or read through the example luwak_mr file luwak_mr_words.erl, you already know how chunk-merging works.

The basic idea behind chunk-merging is that a map function should return information about data that it didn’t know how to handle, as well as an answer for what it did know how to handle. A second processing step (a subsequent reduce function in this case) can then match up those bits of unhandled data from all of the different map evaluations, and get answers for them as well.

I’ve updated baseball.erl to do just this. The map function now uses regexes much like those earlier in this post to produce “suffix” and “prefix” results for unhandled data at the start and end of the block. The reduce function then combines these chunks and produces additional hit:at-bat results that can be summed with the normal map output.

For example, instead of the simple count tuple a map used to produce:

“`erlang
[{5, 50}]
“`

The function will now produce something like:

“`erlang
[{5, 50},
{suffix, 2000000, <<"e101,??,,S7/G">>},
{prefix, 3000000, <<"play,7,1,kue">>}]
“`

Fixed-length Records

Another way to deal with boundary-crossing records is to avoid them entirely. If every record is exactly the same length, then it’s possible to specify a block size that is an even multiple of the record length, such that record boundaries will align with block boundaries.

I’ve added baseball_flr.erl to the baseball project to demonstrate using fixed-length records. The two records needed from the “play” record for the batting average calculation are the player’s Retrosheet ID (the third field in the CSV format) and the play description (the sixth CSV field). The player ID is easy to handle: it’s already a fixed length of eight characters. The play description is, unfortunately, variable in length.

I’ve elected to solve the variable-length field problem with the time-honored solution of choosing a fixed length larger than the largest variation I have on record, and padding all smaller values out to that length. In this case, 50 bytes will handle the play descriptions for the 1950s. Another option would have been to truncate all play descriptions to the first two bytes, since that’s all the batting average calculation needs.

So, the file contents are no longer:

“`text
play,3,1,mathe101,??,,S7/G
play,7,1,kuenh101,??,,4/L
“`

but are now:

“`text
mathe101S7/G………………………………..
kuenh1014/L…………………………………
“`

(though a zero is used instead of a ‘.’ in the actual format, and there are also no line breaks).

Setting up the block size is done at load time in baseball_flr:load_events/1. The map function to calculate the batting average on this format has to change the way in which it extracts each record from the block, but the analysis of the play data remains the same, and there is no need to worry about partial records. The reduce
function is exactly the same as it was before learning about chunks (though the chunk-handling version would also work; it just wouldn’t find any chunks to merge).

Using this method does require reloading the data to get it in the proper format in Riak, but this format can have benefits beyond alleviating the boundary problem. Most notably, analyzing fixed-length records is usually much faster than analyzing variable-length, comma-separated records, since the record-splitter doesn’t have to search for the end of a record — it knows exactly where to find each one in advance.

“Fixed”

Now that I have solutions to the boundary problems, I can correctly award Harvey Kuenn’s 1950s batting average as:

“`text
(riak@127.0.0.1)8> baseball:batting_average(<<”1950s”>>, <<”kuenh101″>>).
284
(riak@127.0.0.1)9> baseball_flr:batting_average(<<”1950s_flr”>>, <<”kuenh101″>>).
284
“`

instead of the incorrect value given by the old, boundary-confused code:

“`text
(riak@127.0.0.1)7> baseball:batting_average(<<”1950s”>>, <<”kuenh101″>>).
284
“`

… wait. Did I forget to reload something? Maybe I better check the counts before division. New code:

“`text
(riak@127.0.0.1)20> C:mapred({modfun, luwak_mr, file, <<”1950s_flr”>>},
(riak@127.0.0.1)20> [{map, {modfun, baseball_flr, ba_map},
(riak@127.0.0.1)20> <<"kuenh101">>, false},
(riak@127.0.0.1)20> {reduce, {modfun, baseball_flr, ba_reduce},
(riak@127.0.0.1)20> none, true}]).
{ok,[{1231,4322}]}
“`

old code:

“`text
(riak@127.0.0.1)19> C:mapred({modfun, luwak_mr, file, <<”1950s”>>},
(riak@127.0.0.1)19> [{map, {modfun, baseball, ba_map},
(riak@127.0.0.1)19> <<"kuenh101">>, false},
(riak@127.0.0.1)19> {reduce, {modfun, baseball, ba_reduce},
(riak@127.0.0.1)19> none, true}]).
{ok,[{1231,4321}]}
“`

Aha: 1231 hits from both, but the new code found an extra at-bat — 4322 instead of 4321. The division says 0.28482 instead of 0.28488. I introduced more error by coding bad math (truncating instead of rounding) than I did by missing a record!

This result highlights a third method of dealing with record splits: ignore them. If the data you are combing through is statistically large, a single missing record will not change your answer significantly. If completely ignoring them makes you too squeemish, consider adding a simple “unknowns” counter to your calculation, so you can compute later how far off your answer might have been.

For example, instead of returning “suffix” and “prefix” information, I might have returned a simpler “unknown” count every time a block had a broken record at one of its ends (instead of a hit:at-bat tuple, a hit:at-bat:unknowns tuple). Summing these would have given me 47, if every boundary in my 48-block file broke a record. With that, I can say that if every one of those broken records was a hit for Harvey, then his batting average might have been as high as (1231+47)/(4321+47)=0.2926. Similarly, if every one of those broken records was a non-hit at-bat for Harvey, then his batting average might have been as low as 1231/(4321+47)=0.2818.

So, three options for you: recombine split records, avoid split records, or ignore split records. Do what your data needs. Happy map/reducing!

-Bryan

A Short Survey For Developers

December 29, 2010

The Dev team here at Basho is in the process of prioritizing some code and new feature development. So, we wanted your opinion on it. We threw together a short, simple survey to get some feedback on where we should be spending our time.

Whether you’re running Riak in production right now or only considering it for a future app, we want your feedback. It shouldn’t take you more than three minutes and it will greatly help us over the coming months.

Let us know if have any questions and thanks for participating.

The Basho Team

A Few Noteworthy Contributions to Riak

December 16, 2010

The community contributions to Riak have been increasing at an exciting rate. The pull requests are starting to roll in, and I wanted to take a moment and recognize several of the many contributions we’ve received over the past months and weeks (and days).

Ripple

Anyone who uses Riak with Ruby knows about Ripple. This is Basho’s canonical Ruby driver and its development has been spearheaded by Basho hacker Sean Cribbs. Not long after Sean started developing this code, he saw a significant influx in Rubyists who were interested in using Riak with Ruby and wanted to lend a hand in the driver’s development. Sean was happy to oblige and, as a result, there are now 15 developers in addition to Sean who have contributed to Ripple in a significant way. Special recognition should also be given to Duff Omelia and Adam Hunter who have made significant contributions to the code and use it in production.

Riak-js

Francisco Treacy and the team at Widescript made it known many months ago that they were looking into Riak to power part of their application. They, along with several other community members, were experimenting with Riak and Node.js. There were a few Node clients for Riak, but they were primarily experimental and immature. Basho had plans to write one but development time was stretched and a node client was several months off.

So, they rolled their own. Francisco, along with Alexander Sicular, James Sadler, Jakub Stastny, and Rick Olson developed and released riak-js. Since its release, it has picked up a ton of users and is being used in applications all over the place. (We liked it so much we even decided to build an app on it… more on this later.).

Thanks, guys, for the node client and helping to kickstart the Riak+Node.js community.

Riak Support in Spring Data

VMware’s Spring Data project is an ambitious one, and it has huge implications for the proliferation of new database technologies in application stacks everywhere. VMware made it known that Riak was slated for integration, needing only someone to take the time to write the code to connect the two. Jon Brisbin took up the task and never looked back.

Jon’s twitter stream is essentially a running narrative of how his work on Riak developed and, as you can see, it took about a month to build support for Riak into the Grails framework, the culmination of which was the 1.0.0.M1 release of the Riak Support in Spring Data.

So, if you’re using Riak with Spring Data, you have Jon Brisbin to thank for the code that made it possible. Thanks, Jon.

Python Docs

I met Daniel Lindsley at StrangeLoop in October. Rusty Klophaus and I were helping him debug a somewhat punishing benchmarking test he was running against a three node Riak cluster on his laptop (during a Cassandra talk) using Basho’s Python client. About a month later Daniel wrote a fantastic blog post called Getting Started With Riak & Python. Though his impressions of Riak were positive on the whole, one of the main points of pain for Daniel was that the Python library had poor documentation. At the time, this was true. Though the library was quite mature as far as functionality goes, the docs had been neglected. I got in touch with Daniel, thanked him for the post, and let him know we were working on the docs. He mentioned he would take a stab at updating the docs if he had a free moment. Shortly thereafter Daniel sent over a huge pull request. He rewrote all of the Python documentation! And it’s beautiful. Check them out here.

Thanks to Daniel and the rest of the team at Pragmatic Badger, we have robust Python documentation. Thanks for the contribution.

Want to contribute to Riak? There is still much code to be written and the Riak community is a great place to work and play. Download the code, join us on IRC, or take a look at GitHub repos to get started.

Mark

Basho at Philadelphia ETE Conference April 10-11

April 9, 2012

The 7th annual Philadelphia ETE conference is taking place this week at the Sheraton Society Hill Hotel, located in Philadelphia’s beautiful historic district. The organizers have certainly not pulled any punches when putting together this year’s lineup of speakers. The conference has been sold out for weeks, so if you don’t already have your ticket you’re missing a good one.

If you will be in attendance, you won’t be able to miss Basho at this event. Not only have we kicked in to sponsor the screens for every presentation you see, but we will also have a booth setup on the floor, so be sure to stop by and grab some swag – slap a Basho sticker on your laptop and be the envy of everyone at your local cafe!

Ryan Zezeski, Mike Walsh and Tom Santero will be in attendance, representing Basho. Be sure to grab one of them in between sessions or during Happy Hour to say hi and talk Riak.

See you there!

Tom

Riak at PyConZA

**October 03, 2012**

[PyConZA](http://za.pycon.org/) starts this Thursday in Cape Town, South Africa, and Basho is proud to be [a sponsor](http://za.pycon.org/sponsors_basho.html). As our blurb on the PyConZA site states:

>
Basho is proud to sponsor PyConZA because we believe in the power and future of the South African tech community. We look forward to the valuable, lasting technologies that will be coming out of Cape Town and the surrounding areas in the coming years.
>

You may remember that back in August we [put out the call](http://basho.com/blog/technical/2012/08/23/Riak-ambassadors-needed-for-PyCon-ZA/) for Riak Community Ambassadors to help us out with PyconZA. As hard as it is to miss out on a chance to go to The Mother City, it wasn’t feasible for us to make it with [RICON](http://ricon2012.com) happening next week. I’m happy to report that a few fans of durability have stepped forward as Ambassadors to make sure Riak is fully-represented. If you’re lucky enough to be going to PyconZA (it’s sold out), be on the lookout for the following two characters:

### Joshua Maserow

### Mike Jones

In addition to taking in the talks, Mike and Joshua will be on hand to answer your Riak questions (or tell you where to look for the answers if they don’t have them). There will also be some production Riak users in the crowd, so you won’t have to look too far if you want to get up to speed on why you should be running Riak.

Enjoy PyconZA. And thanks again to Mike and Joshua for volunteering to represent Riak in our absence.

[Mark](http://twitter.com/pharkmillups)