Tag Archives: Riak

Free Webinar – Riak Operations – April 14 @ 2PM Eastern

April 7, 2011

Riak, like any other database, is not a piece of your infrastructure to be taken lightly. There are a variety of operational concerns that must be taken into account to ensure both reliability and performance of your Riak cluster. Deployments to different types of environments means knowing the nuances of each environment and being able to adapt to its performance characteristics. This webinar will cover many of the best practices in maintaining your Riak cluster in a variety of environments, as well as monitoring your cluster and understanding how to diagnose performance issues.

We invite you to join us for a free webinar on Thursday, April 14 at 2:00PM Eastern Time (UTC-4) to talk about Riak Operations. In this webinar, we’ll discuss:

  • Basic Riak Operations (config files, logs, etc)
  • Monitoring Riak
  • Backups
  • Understanding Riak Performance
  • Riak in Virtualized Environments

We’ll address the above topics as well as many others. The presentation will last 30 to 45 minutes, with time for questions at the end.

Registration is now closed. 

Ripple 0.9 Release

April 4, 2011

I’m proud to announce the Ripple family of gems (riak-client, ripple, riak-sessions) version 0.9.0 were released yesterday. This is a huge leap forward from the 0.8 series, last release of which was in December. I’m going to highlight some of the best new features in this blog post.

HTTP+SSL Support

Adam Hunter did an awesome job implementing support for HTTPS, including navigating the various idiosyncracies of HTTP client libraries. If you’ve got HTTPS turned on in Riak, or a reverse-proxy in front that provides SSL, it’s easy to set up.

“`ruby
# Turn on SSL
client = Riak::Client.new(:ssl => true)

# Alternatively, provide the protocol client = Riak::Client.new(:protocol => “https”)

# Want to be a good SSL citizen? Use a client certificate.
# This can be used to authenticate clients automatically on the server-side.
client.ssl = {:pem_file => “/path/to/pemfile”}

# Use the CA chain for server verification
client.ssl = { :ca_path => “/path/to/ca_cert/dir” }

# All three of the above options will invoke “peer” verification.
# Use “none” verification only if you’re lazy. This is the default
# if you don’t specify a client certificate or CA.
client.ssl = { :verify_mode => “none” }
“`

Adam also added HTTP Basic authentication for those who use it on their reverse-proxy servers. It can be set with the :basic_auth option/accessor as a string of “user:password”.

Protocol Buffers

Riak has had a Protocol Buffers-based client API for a long time, but the state of Protocol Buffers support in Ruby has been very bad until recently. Thanks to Blake Mizerany’s “Beefcake” library, it was really simple to add support in a cross-platform way. While not insanely faster, the decreased overhead for many operations can make a big difference in the long run. Check out these benchmarks (run on MRI 1.9.2, comparing against the Excon backend):

                               user     system      total        real
http  ping                 0.020000   0.010000   0.030000 (  0.084994)
pbc   ping                 0.000000   0.000000   0.000000 (  0.007313)
http  buckets              0.010000   0.000000   0.010000 (  0.894827)
pbc   buckets              0.000000   0.000000   0.000000 (  0.864926)
http  get_bucket           0.480000   0.020000   0.500000 (  1.075365)
pbc   get_bucket           0.170000   0.030000   0.200000 (  0.271493)
http  set_bucket           0.060000   0.000000   0.060000 (  0.660926)
pbc   set_bucket           0.030000   0.000000   0.030000 (  0.579500)
http  store_new            0.710000   0.040000   0.750000 (  2.443635)
pbc   store_new            0.630000   0.030000   0.660000 (  1.382278)
http  store_key            0.730000   0.040000   0.770000 (  2.779741)
pbc   store_key            0.580000   0.020000   0.600000 (  1.539332)
http  fetch_key            0.690000   0.030000   0.720000 (  2.014679)
pbc   fetch_key            0.410000   0.030000   0.440000 (  0.948865)
http  keys                 0.300000   0.090000   0.390000 ( 78.455719)
pbc   keys                 0.530000   0.020000   0.550000 (  0.828484)
http  key_stream           0.200000   0.010000   0.210000 (  0.689116)
pbc   key_stream           0.530000   0.010000   0.540000 (  0.833347)

Adding Protocol Buffers required a breaking change in the Riak::Client class, namely that the port setting/accessor was split into http_port and pb_port. If you still have this setting in a configuration file or your code, you will receive a deprecation warning.

“`ruby
# Use Protocol Buffers! (default port is 8087)
client = Riak::Client.new(:protocol => “pbc”, :pb_port => 8087)

# Use HTTP and Protocol Buffers in parallel, too! (Luwak and Search require HTTP)
client.store_file(“bigpic.jpg”, “image/jpeg”, File.open(“images/bigpic.jpg”, ‘rb’))
“`

**Warning**: Because some operations (namely get_bucket_props and set_bucket_props) are not semantically equivalent on both interfaces, you might run into some unexpected problems. I have been assured that these differences will be fixed soon.

MapReduce Improvements

Streaming MapReduce is now supported on both protocols. This lets you handle results as they are produced by MapReduce rather than waiting for all the results to be accumulated. Unlike the traditional mode, you will be passed a phase number in addition to the data from the phase. Like key-streaming, just give a block to the run method.

“`ruby
# Make a MapReduce job like usual.
Riak::MapReduce.new(client).
add(“people”,”sean”).
link(:tag => “friend”).
map(“Riak.mapValuesJson”, :keep => true).
run do |phase, data| # Streaming!
puts data.inspect
end
“`

MapReduce key-filters were available in beta releases of 0.9. This is a new feature in Riak 0.14 that lets you reduce the number of keys fed to your MapReduce query by some criteria of the key names. Here’s an example:

“`ruby
# Blockish builder syntax for key-filters
Riak::MapReduce.new(client).
filter(“posts”) do
tokenize “-”, 1 # Split the key on dashes, take the first token
string_to_int # Convert the token to an integer
eq 2011 # Only pass the ones from 2011
end

# Same as above, without builder syntax
Riak::MapReduce.new(client).
add(“posts”, [["tokenize", "-", 1],
["string_to_int"],
["eq", 2011]])
“`

Ripple::Document Improvements

Ripple::Document models got a lot of small improvements, including:

  • Callback ordering was fixed.
  • Documents can be serialized to JSON e.g. for API responses.
  • Client errors bubble up when saving a Document.
  • Several association proxy bugs were fixed.
  • The datetime serialization format defaults to ISO8601 but is also configurable.
  • Mass-attribute-assignment protection was added, including protecting :key by default.
  • Embedded documents can be compared for equality, which amounts to attribute equality when under the same parent document.
  • Documents can now have observer classes which can also be generated by the ripple:observer generator.

Testing Improvements

In order to make sure that the client layer is sufficiently independent of transport semantics and that the lower layers comply with the “unified” backend API, there is a new suite of integration tests for riak-client that covers operations that are supported by both transport mechanisms. This should make it much easier to implement new client backends in the future.

The Riak::TestServer was made faster and more reliable by a few changes to the Erlang bits that power it.

Onward to 1.0

Recently the committers and some members of the community joined me in discussing some key features that need to be in Ripple before it reaches “1.0″ status. Some of them will be really incredible, and I’m anxious to get started on them:

  • Enhanced Document querying (scopes, indexing, lazy loading, etc)
  • User-defined sibling-resolution policies, with automatic retries
  • Enhanced Riak Search features
  • Platform-specific Protocol Buffers drivers (MRI C, JRuby, Rubinius C++)
  • Server templates for creating development clusters (extracted from Riak::TestServer)

To that end, I’ve created a 0.9-stable branch which will only receive bugfixes going forward. All new development for 1.0 will be done on master. We’re likely to break some legacy APIs, so we will try to add deprecation notices to the 0.9 series where possible.

Enjoy this latest release of Ripple!

Sean and contributors

Why MapReduce is Easy

March 30, 2011

There’s something about MapReduce that makes it seem rather scary. It almost has this Big Data aura surrounding it, making it seem like it should only be used to analyze a large amount of data in a distributed fashion. It’s one of the pieces that makes Riak a pretty versatile key-value store. Feed a bunch of keys into it, and do some analytics on the objects, quite handy.

But when you narrow it down to just the basics, MapReduce is pretty simple. I’m almost 100% certain even that you’ve used it in one way or another in an application you’ve written. So before we go all distributed, let’s break MapReduce down into something small that you can use every day. That certainly has helped me understand it much better.

For our webinar on Riak and Node.js we built a little application with Node.js and Riak Search to store and search syslog messages. It’s called Riaktant and handily converts and stores syslog messages in a way that’s friendlier for both Riak Search and MapReduce. We’ll base this on examples we used in building the application.

MapReduce is easy because it works on simple data

MapReduce loves simple data structures. Why? Because when there are no deep, nested relationships between say, objects, distributing data for parallel processing is a breeze. But I’m getting a little ahead of myself.

Let’s take the data Riaktant stores in Riak and see how easy it is to sift through it without even having to go distributed. It uses a JavaScript library called glossy to parse a syslog message and turn it into this nice JSON data structure.

javascript
message = {
"originalMessage": "<35>1 2011-02-14T11:10:25.137+01:00 lb1.basho.com ftpd 7003 - Client disconnected",
"time": "2011-02-14T10:10:25.137Z",
"severityID": 3,
"facility": "auth",
"version": 1,
"prival": 35,
"host": "lb1.basho.com",
"facilityID": 4,
"message": "7003 - Client disconnected",
"severity": "err"
}

MapReduce is easy because you use it every day

I’m almost 100% certain you use MapReduce every day. If not daily, then at least once a week. Whenever you have a list of items that you loop or iterate over and transform into something else one by one, if only to extract a single attribute, there’s your map function.

Keeping with JavaScript, here’s how you’d extract the host from the above JSON, for a whole list:

“`javascript
messages = [message];

messages.map(function(message) {
return message.host
}))
“`

Or, if you insist, here’s the Ruby equivalent:

ruby
messages.map do |message|
message[:host]
end

If you must ask, here’s Python, using a list comprehension, for added functional programming sugar:

python
[message['hello'] for message in messages]

There, so simple, right? Halfway there to some full-fledged MapReduce action.

MapReduce is easy because it’s just code

Before we continue, let’s add another syslog message.

javascript
message2 = {
"originalMessage": "<35>1 2011-02-14T11:10:25.137+01:00 web2.basho.com ftpd 7003 - Client disconnected",
"time": "2011-02-14T10:12:37.137Z",
"severityID": 3,
"facility": "http",
"version": 1,
"prival": 35,
"host": "web2.basho.com",
"facilityID": 4,
"message": "7003 - Client disconnected",
"severity": "warn"
}
messages.push(message2)

We can take the above example even further (still using JavaScript), and perform some additional operations like result sorting, for example.

javascript
messages.map(function(message) {
return message.host
}).sort()

This gives us a nice sorted list of hosts. Coincidentally, sorting happens to be the second step in traditional MapReduce. Isn’t it nice how easily this is coming together?

The third and last step involves, you guessed it, more code. I don’t know about you, but I love things that involve code. Let’s reduce the list of hosts and count the occurrences of each host, (and if this reminds you of an SQL query that involves GROUP BY, you’re right on track).

“`
var reduce = function(total, host) {
if (host in total) {
total[host] += 1
} else {
total[host] = 1
}
return total
}

messages.map(function(message) {
return message.host
}).sort().reduce(reduce, {})
“`

There’s one tiny bit missing for this to be as close to MapReduce as we can get without going distributed. We need to slice up the list before we hand it to the map function. As JavaScript doesn’t have a built-in function to partition a list we’ll whip up our own real quick. After all, we’ve come this far.

function chunk(list, chunkSize) {
for(var position, i = 0, chunk = -1, chunks = []; i < list.length; i++) {
if (position = i % chunkSize) {
chunks[chunk][position] = list[i]
} else {
chunk++;
chunks[chunk] = [list[i]]
}
}
return chunks;
}

It loops through the list, splitting it up into equally sized chunks, returning them neatly wrapped in a list.

Now we can chunk the initial list of messages, and boom, we have our own little MapReduce going, without magic, just code. Let’s put the new chunk function to good use.

javascript
var mapResults = [];
chunk(messages, 2).forEach(function(chunk) {
var messages = chunk.map(function(message) {
return message.host
})
mapResults = mapResults.concat(messages)
})
mapResults.sort().reduce(reduce, {})

We split up the messages into two chunks, run the map function for each chunk, collecting the results as we go. Then we sort the results and feed them into the reduce function. That’s MapReduce in eight lines of JavaScript code. Easy, right?

That’s all there’s to MapReduce. You use it every day, whether you’re aware of it or not. It works nicely with simple data structures, and it’s just code.

Unfortunately, things get complicated as soon as you go distributed, for example in a Riak cluster. But we’ll save that for the next post, where we’ll examine why MapReduce is hard.

Mathias

Riak and Scala at Yammer

March 28, 2011

What’s the best way to start off the week? With an awesome presentation from some very talented engineers about building a Riak-backed service.

This video, which runs about 40 minutes, was recorded last week at the San Francisco Riak Meetup and is worth every minute of your time. Coda Hale and Ryan Kennedy of Yammer give an excellent and in depth look into how they built “Streamie”, why Riak was the right choice, and the lessons learned in the process.

Also:

* The PDF of the slide presentation can be viewed here
* Around the five minute mark Coda references a paper called “The Declarative Imperative: Experiences and Conjectures in Distributed Logic.”
* If you are interested in talking about your Riak usage, get in touch with mark@basho.com and we’ll get the word out.

Enjoy.

Mark

There weren’t too many questions asked at the end of the presentation so we decided to cut them out of the recording in the interest of time. Apologies for this. Here they are:

  • What local storage backend are you using? Bitcask.
  • How many keys are you currently storing? Around 5 million.
  • What is the average value size? Under 10K
  • Can you share your hardware specs? Bare-metal, standard servers: 8-core, 16GB RAM, SATA drives.

Riak and Scala at Yammer from Basho Technologies on Vimeo.

Follow Up To Riak and Node.js Webinar

March 18, 2011

Thanks to all who attended Wednesday’s webinar on Riak (Search) and Node.js. If you couldn’t make it you can find a screencast of the webinar below. You can also check out the slides directly.

We hope we could give you a good idea what you can use the winning combination of Riak and Node.js for, by showing you our little syslog-emulating sample application, Riaktant. We made the source code available, so if you feel like running your own syslog replacement, go right ahead and let us know how things go. Of course you can just dig into the code and see how nicely Node.js and Riak play together too.

If you want to get some practical ideas how we utilized Riak’s MapReduce to analyze the log data, have a look at the functions used by the web interface. You can throw these right into the Node.js console and try them out yourself, since riak-js, the Node.js client for Riak, accepts JavaScript functions, so you don’t have to serialize them into a string yourself.

Thanks to Joyent for providing us with SmartMachines running Riak, and for offering No.de, their great hosting service for Node.js applications, where we deployed our little app with great ease.

Sean and Mathias

Free Webinar – Riak with Node.js – March 15 @ 2PM Eastern

March 8, 2011

JavaScript is the lingua franca of the web, and many developers are starting to use node.js to power their server-side applications. Riak is a flexible, scalable database that has a JavaScript-friendly interface, including MapReduce in JavaScript and an awesome client library called riak-js. Put the two together and you have lots of possibilities!

We invite you to join us for a free webinar on Tuesday, March 15 at 2:00PM Eastern Time (UTC-4) to talk about Riak with node.js. In this webinar, we’ll discuss:

  • Getting riak-js, the Riak client for node.js, into your application
  • Storing, retrieving, manipulating key-value data
  • Issuing MapReduce queries
  • Finding data with Riak Search
  • Testing your code with the TestServer

We’ll address the above topics in addition to looking at a sample application. The presentation will last 30 to 45 minutes, with time for questions at the end. Fill in the form below if you want to get started building node.js applications on top of Riak!

KillDashNine March Happening on Wednesday

March 5, 2011

In February we kicked off the KillDashNine drinkup. It was a huge success (turns out we aren’t the only ones who care about durability) and, as promised, we’ll be having another drinkup this month. On Wednesday, 3/9, we will be clinking glasses and sharing data loss horror stories at Bloodhound, located at 1145 Folsom Street here in San Francisco.

This month’s chosen cocktail is the *Data Eraser*, and it’s simple to make: 2 oz Vodka, 2 Oz Coffee Liqueur, 2 oz Tonic, and a dash of bitter frustration, anguish, and confusion (which is more or less how one feels when their data just disappears). And if you can’t make it, be sure to pour yourself a Data Eraser on 3/9 to take part in the festivities from wherever you happen to find yourself (or you can run your own local KillDashNine like Marten Gustafson did in Stockholm last month.

Registration details for the event are here, so be sure to RSVP if you’re planning to join us. In the mean time, spin up a few nodes of your favorite database and try your hand at terminating some processes with the help of our favorite command: _kill-9_.

Long Live Durability!

Basho

 

Announcing KevBurnsJr as a PHP Client Committer

February 28, 2011

We just added Kevin Burns, who goes by KevBurnsJr on Github, Twitter and the #riak IRC room on irc.freenode.net, as a committer to the Basho-supported Riak PHP Client.

Kevin has been hard at work over the past few weeks adding some great functionality to the PHP client and has even kicked off porting Ripple, Basho’s Ruby Client ODM, to PHP. Suffice it to say that we at Basho are excited about Kevin’s participation and involvement with Riak and our PHP code.

Some relevant code:

* Riak’s PHP Client
* Port of Ripple to PHP

Thank, Kev! We are looking forward to your contributions.

Mark

MapReducing Big Data With Luwak Webinar

February 14, 2011

Basho Senior Engineer Bryan Fink has been doing some exceptional work with MapReduce and Luwak, Riak’s large-object storage interface. Recently, he wrote up two extensive blog posts on the specifics of Luwak and the powerful tool it makes when combined with Riak’s MapReduce engine:

We’ve seen a huge amount of Luwak usage since its release and, since these blog posts, a large amount of interest in running MapReduce queries over data stored in Riak via Luwak. So, we thought what better way to spread the word than through a free Webinar?

This Thursday, February 17th at 2PM EST, Bryan will be leading the MapReducing Big Data With Luwak Webinar. The planned agenda is as follows:

  • Overview of Riak MapReduce and its typical usage
  • Gotchas and troubleshooting
  • Usage Recommendations and Best Practices
  • An Introduction to Luwak, Riak’s Large File Storage Interface
  • Luwak MapReduce in Action

Registration is now closed.

Hope to see you there.

The Basho Team

 

Creating a Local Riak Cluster with Vagrant and Chef

February 4, 2011

The “Riak Fast Track” has been around for at least nine months now, and lots of developers have gotten to know Riak that way, building their own local clusters from the Riak source. But there’s always been something that has bothered me about that process, namely, that the developer has to build Riak herself. Basho provides pre-built packages on downloads.basho.com for several Linux distributions, Solaris, and Mac OS/X, but these have the limitation of only letting you run one node on a machine.

I’ve been a long-time fan of Chef the systems and configuration management tool by Opscode, especially for the wealth of community recipes and vibrant participation. It’s also incredibly easy to get started with small Chef deployments with Opscode’s Platform, which is free for up to 5 managed machines.

Anyway, as part of updating Riak’s Chef recipe last month to work with the 0.14.0 release, I discovered the easiest way to test the recipe — without incurring the costs of Amazon EC2 — was to deploy local virtual machines with Vagrant. So this blog post will be a tutorial on how to create your own local 3-node Riak cluster with Chef and Vagrant, suitable for doing the rest of the Fast Track.

Before we start, I’d like to thank Joshua Timberman and Seth Chisamore from Opscode who helped me immensely in preparing this.

Step 1: Install VirtualBox

Under the covers, Vagrant uses VirtualBox, which is a free virtualization product, originally created at Sun. Go ahead and download and install the version appropriate for your platform:

Step 2: Install Vagrant and Chef

Now that we have VirtualBox installed, let’s get Vagrant and Chef. You’ll need Ruby and Rubygems installed for this. Mac OS/X comes with these pre-installed, but they’re easy to get on most platforms.

Now that you’ve got them both installed, you need to get a virtual machine image to run Riak from. Luckily, Opscode “has provided some images for us that have the 0.9.12 Chef gems preinstalled. Download the Ubuntu 10.04 image and add it to your local collection:

Step 3: Configure Local Chef

Head on over to Opscode and sign up for a free Platform account if you haven’t already. This gives you access to the cookbooks site as well as the Chef admin UI. Make sure to collect your “knife config” and “validation key” from the “Organizations” page of the admin UI, and your personal “private key” from your profile page. These help you connect your local working space to the server.

Now let’s get our Chef workspace set up. You need a directory that has specific files and subdirectories in it, also known as a “Chef repository”. Again Opscode has made this easy on us, we can just clone their skeleton repository:

Now let’s put the canonical Opscode cookbooks (including the Riak one) in our repository:

Finally, put the Platform credentials we downloaded above inside the repository (the .pem files will be named differently for you):

Step 4: Configure Chef Server

Now we’re going to prep the Chef Server (provided by Opscode Platform) to serve out the recipes needed by our local cluster nodes. The first step is to upload the two cookbooks we need using the *knife* command-line tool, shown in the snippet below the next paragraph. I’ve left out the output since it can get long.

Then we’ll create a “role” — essentially a collection of recipes and attributes — that will represent our local cluster nodes, and call it “riak-vagrant”. Using knife role create will open your configured EDITOR (mine happens to be emacs) with the JSON representation of the role. The role will be posted to the Chef server when you save and close your editor.

The key things to note about what we’re editing in the role below are the “run list” and the “override attributes” sections. The “run list” tells what recipes to execute on a machine that receives the role. We configure iptables to run with Riak, and of course the relevant Riak recipes. The “override attributes” change default settings that come with the cookbooks. I’ve put explanations inline, but to summarize, we want to bind Riak to all network interfaces, and put it in a cluster named “vagrant” which will be used by the “riak::autoconf” recipe to automatically join our nodes together.

Step 5: Setup Vagrant VM

Now that we’re ready on the Chef side of things, let’s get Vagrant going. Make three directories inside your Chef repository called dev1, dev2, and dev3, just like from the Fast Track. Change directory inside dev and run vagrant init. This will create a Vagrantfile which you should edit to look like this one (explanations inline again):

Remember: change any place where it says ORGNAME to match your Opscode Platform organization.

Step 6: Start up dev1 Now we’re ready to see if all our preparation has paid off:

If you see lines at the end of the output like the ones above, it worked! If it doesn’t work the first time, try running vagrant provision from the command line to invoke Chef again. Let’s see if our Riak node is functional:

Awesome!

Step 7: Repeat with dev2, dev3

Now let’s get the other nodes set up. Since we’ve done the hard parts already, we just need to copy the Vagrantfile from dev1/ into the other two directories and modify them slightly.

The easiest way to describe the modifications is in a table:

| Line | dev2 | dev3 | Explanation |
| 7 | “33.33.33.12″ | “33.33.33.13″ | Unique IP addresses |
| 11 (last number) | 8092 | 8093 | HTTP port forwarding |
| 12 (last number) | 8082 | 8083 | PBC port forwarding |
| 40 | “riak-fast-track-2″ | “riak-fast-track-3″ | Unique chef node name |
| 48 | “riak@33.33.33.12″ | “riak@33.33.33.13″ | Unique Riak node name |

With those modified, start up dev2 (run vagrant up inside dev2/) and watch it connect to the cluster automatically. Then repeat with dev3 and enjoy your local Riak cluster!

Conclusions

Beyond just being a demonstration of cool technology like Chef and Vagrant, you’ve now got a developer setup that is isolated and reproducible. If one of the VMs gets too messed up, you can easily recreate the whole cluster. It’s also easy to get new developers in your organization started using Riak since all they have to do is boot up some virtual machines that automatically configure themselves. This Chef configuration, slightly modified, could later be used to launch staging and production clusters on other hardware (including cloud providers). All in all, it’s a great tool to have in your toolbelt.

Sean