Tag Archives: MapReduce

MapReducing Big Data With Luwak Webinar

February 14, 2011

Basho Senior Engineer Bryan Fink has been doing some exceptional work with MapReduce and Luwak, Riak’s large-object storage interface. Recently, he wrote up two extensive blog posts on the specifics of Luwak and the powerful tool it makes when combined with Riak’s MapReduce engine:

We’ve seen a huge amount of Luwak usage since its release and, since these blog posts, a large amount of interest in running MapReduce queries over data stored in Riak via Luwak. So, we thought what better way to spread the word than through a free Webinar?

This Thursday, February 17th at 2PM EST, Bryan will be leading the MapReducing Big Data With Luwak Webinar. The planned agenda is as follows:

  • Overview of Riak MapReduce and its typical usage
  • Gotchas and troubleshooting
  • Usage Recommendations and Best Practices
  • An Introduction to Luwak, Riak’s Large File Storage Interface
  • Luwak MapReduce in Action

Registration is now closed.

Hope to see you there.

The Basho Team

 

Baseball Batting Average, Using Riak Map/Reduce

January 20, 2011

A few days ago, I announced a tool that I assembled last weekend, called luwak_mr. That tool extends Riak’s map/reduce functionality to “Luwak” files.

But what does that mean? What can it do?

Luwak is a tree-based block-storage library for Riak. Basically, you feed Luwak a large binary, and it splits the binary into chunks, and creates a tree representing how those chunks fit together. Each chunk (or “block”) is stored as a separate value in Riak, and the tree structure is stored under whatever “filename” you give it. Among other things, this allows for much more effecient access to ranges of the binary (in comparison to storing the entire binary as one value in Riak, forcing it to be read and written in its entirety).

The luwak_mr tool allows you to easily feed a chunked Luwak file into Riak’s map/reduce system. It will do this in such a way as to provide each chunk for map processing, individually. For example, if you had a Luwak file named “foo” made of ten blocks, the following map/reduce request would evaluate the “BarFun“ function ten times (once for each block):

“`erlang
C:mapred({modfun, luwak_mr, file, <<”foo”>>},
[{map, BarFun, none, true}]).
“`

So what’s that good for?

Partitioning distributed work is the boon of Luwak+luwak_mr. If you’re using a multi-node Riak cluster, Luwak has done the work of spreading pieces of your large binary across all of your nodes. The luwak_mr tool allows you to capitalize on that distribution by using Riak’s map/reduce system to analyze those pieces, in parallel, on the nodes where the pieces are stored.

How about a more concrete example? The common one is distributed grep, but I find that a little boring and contrived. How about something more fun … like baseball statistics.
[1]
[2]

I’ll use Retrosheet’s Play-by-Play Event Files as input. Specifically, I’ll use the regular season, by decade, 1950-1959. If you’d like to follow along download “1950seve.zip” and unzip to a directory called “1950s”

If you look at one of those files, say “1950BOS.EVA”, you’ll see that each event is a line of comma-separated values. I’m interested in the “play” records for this computation. The first one in that file is on line 52:

“`text
play,1,0,rizzp101,??,,K
“`

This says that in the first inning (1), the away (0) player “Phil Rizzuto” (rizzp101), struck out (K). For the purposes of the batting average calculation, this is one at-bat, no hit.

Using grep [3], I can find all of Phil’s “plays” in the 1950s like so:

“`bash
$ grep -e play,.,.,rizzp101 *.EV*
1950BOS.EVA:play,1,0,rizzp101,??,,K
1950BOS.EVA:play,3,0,rizzp101,??,,53
1950BOS.EVA:play,5,0,rizzp101,??,,6
…snip (3224 lines total)…
“`

What I need to do is pile these plays into two categories: those that designate an “at bat,” and those that designate a “hit.” That’s easily done with some extra regular expression, and a little counting:

“`bash
$ grep -E “play,.,.,rizzp101,.*,.*,(S[0-9]|D[0-9]|T[0-9]|H([^P]|$))” *.EV* | wc -l
562
$ grep -E “play,.,.,rizzp101,.*,.*,(NP|BK|CS|DI|OA|PB|WP|PO|SB|I?W|HP|SH)” *.EV* | wc -l
728
“`

The result of the first grep is the number of hits (singles, doubles, triples, home runs) found (562). The result of the second grep is the number of non-at-bat plays (substitutions, base
steals, walks, etc.; 728); if I subtract it from the total number of plays (3224), I get the number of at-bats (2496). Phil’s batting average is 562(hits)/2456(at-bats) (x1000), or 225.

Great, so now let’s parallelize. The first thing I’ll do is get the data stored in Riak. That’s as simple as attaching to any node’s console and running this function:

“`erlang
load_events(Directory) ->
true = filelib:is_dir(Directory),
Name = iolist_to_binary(filename:basename(Directory)),
{ok, Client} = riak:local_client(),
{ok, LuwakFile} = luwak_file:create(Client, Name, dict:new()),
LuwakStream = luwak_put_stream:start_link(Client, LuwakFile, 0, 5000),
filelib:fold_files(Directory,
“.*.EV?”, %% only events files
false, %% non-recursive
fun load_events_fold/2,
LuwakStream),
luwak_put_stream:close(LuwakStream),
ok.

load_events_fold(File, LuwakStream) ->
{ok, FileData} = file:read_file(File),
luwak_put_stream:send(LuwakStream, FileData),
LuwakStream.
“`

I’ve put this code in a module named “baseball”, so running it is as simple as:

“`text
(riak@10.0.0.1) 1> baseball:load_events(“/home/bryan/baseball/1950s”).
“`

This will create one large Luwak file (approximately 48MB) named “1950s” by concatenating all 160 event files. Default Luwak settings are for 1MB blocks, so I’ll have 48 of them linked from my tree.

Mapping those blocks is quite simple. All I have to do is count the hits and at-bats for each block. The code to do so looks like this:

“`erlang
ba_map(LuwakBlock, _, PlayerId) ->
Data = luwak_block:data(LuwakBlock),
[count_at_bats(Data, PlayerId)].

count_at_bats(Data, PlayerId) ->
Re = [<<"^play,.,.,">>,PlayerId,<<",.*,.*,(.*)$">>], %”>>],
case re:run(Data, iolist_to_binary(Re),
[{capture, all_but_first, binary},
global, multiline, {newline, crlf}]) of
{match, Plays} ->
lists:foldl(fun count_at_bats_fold/2, {0,0}, Plays);
nomatch ->
{0, 0}
end.

count_at_bats_fold([Event], {Hits, AtBats}) ->
{case is_hit(Event) of
true -> Hits+1;
false -> Hits
end,
case is_at_bat(Event) of
true -> AtBats+1;
false -> AtBats
end}.

is_hit(Event) ->
match == re:run(Event,
“^(”
“S[0-9]” % single
“|D[0-9]” % double
“|T[0-9]” % triple
“|H([^P]|$)” % home run
“)”,
[{capture, none}]).

is_at_bat(Event) ->
nomatch == re:run(Event,
“^(”
“NP” % no-play
“|BK” % balk
“|CS” % caught stealing
“|DI” % defensive interference
“|OA” % base runner advance
“|PB” % passed ball
“|WP” % wild pitch
“|PO” % picked off
“|SB” % stole base
“|I?W” % walk
“|HP” % hit by pitch
“|SH” % sacrifice (but)
“)”,
[{capture, none}]).
“`

When the ba_map/3 function runs on a block, it produces a 2-element tuple. The first element of that tuple is the number of hits in the block, and the second is the number of at-bats. Combining them is even easier:

“`erlang
ba_reduce(Counts, _) ->
{HitList, AtBatList} = lists:unzip(Counts),
[{lists:sum(HitList), lists:sum(AtBatList)}].
“`

The ba_reduce/2 function expects a list of tuples produced by map function evaluations. It produces a single 2-element tuple whose first element is the sum of the first elements of all of the inputs (the total hits), and whose second; the second elements (the total at-bats).

These functions live in the same baseball module, so using them is simple:

“`erlang
Client:mapred({modfun, luwak_mr, file, Filename},
[{map, {modfun, baseball, ba_map}, PlayerID, false},
{reduce, {modfun, baseball, ba_reduce}, none, true}]),
“`

I’ve exposed that call as batting_average/2 function, so finding Phil Rizzuto’s batting average in the 1950s is as simple as typing at the Riak console:

“`text
(riak@10.0.0.1) 2> baseball:batting_average(<<”1950s”>>, <<”rizzp101″>>).
225
“`

Tada! Parallel processing power! But, you couldn’t possibly let me get away without a micro-benchmark, could you? Here’s what I saw:

Environment Time
grep [4] 0.060 + 0.081 + 0.074 = 0.215s (0.002*3 = 0.006s cached)
Riak, 1 node [5] 0.307s (0.012s cached)
Riak, 4 nodes 0.163s (0.024s cached)

All of the disclaimers about micro-benchmarks apply: disc caches play games, opening and closing files takes time, this isn’t a large enough dataset to highlight the really interesting cases, etc. But, I’m fairly certain that these numbers show two things. The first is that since the Riak times aren’t orders of magnitude off of the grep times, the Riak approach is not fundamentally flawed. The second is that since the amount of time decreases with added nodes, some parallelism is being exploited.

There is, of course, at least one flaw in this specific implementation, though, and also several ways to improve it. Anyone want to pontificate?

-Bryan

[1] Astute readers
will note that this is really distrubted grep, entangled with some
processing, but at least it gives more interesting input and
output data.

[2] Forgive me if I botch these calculations. I stopped playing baseball when the coach was still pitching. The closest I’ve come since is intramural softball. But, it’s all numbers, and I can handle numbers. Wikipedia has been my guide, for better or worse.

[3] Okay, maybe not so much astuteness is necessary.

[4] My method for aquiring the grep stat was, admitedly, lame. Specifically, I ran the grep commands as given above, using the “time” utility (i.e. “time grep -E …”). I added the time for the three runs together, and estimated the time for the final sum and division at 0.

[5] Timing the map/reduce was simple, by using timer:tc(baseball, batting_average, [<<"1950s">>, <<"rizzp101">>])..

A Deeper Look At Riak's MapReduce Enhancements

January 6, 2010

We officially released Riak 0.14 yesterday. Some of the biggest enhancements were in and around Riak’s MapReduce functionality. Here’s a more in-depth look at what you can look forward to in 0.14 if you’re into Mapping and Reducing.

Key Filtering

Performing any type of sophisticated query on a strictly key/value store is notoriously hard. Past releases of Riak were limited to MapReduce-ing over either entire buckets or a discrete set of user-supplied inputs. The problem with these approaches is neither facilitated the kind of robust querying many applications required. For example, let’s examine an application which logs application events to a Riak bucket. Each entry is a JSON hash containing a timestamp, the user generating the event, and some information about the event. Part of our example application requires querying these log entries based on timestamp and user.

In current releases of Riak, the application would have to map over the entire bucket and examine each entry to find the relevant set. This type of query is usable when the bucket is small. But when the bucket gets bigger these types of queries begin to exhibit performance problems. Scanning the bucket and loading objects from disk only to discard them is an expensive proposition.

This is exactly the use case where key filtering can help. If the application can store meaningful data in the keys then key filtering can query just the keys and load only the objects whose keys pass the filter to be processed by the MapReduce job. For our example app we could combine the timestamp and user id to form each entry’s key like this: “1292943572.pjohnson”. Using key filters we can locate all the user entries for “pjohnson” between “12/1/2010″ and “12/7/2010 “and count them via MapReduce:

“`javascript
{“inputs”: {
“bucket”: “user_logs”,
“key_filters”: [["and", [["tokenize", ".", 1],
["string_to_int"],
["between", 1291161600, 1291852799]],
[["tokenize", ".", 2],
["matches", "pjohnson"]]]]},
“query”: [{"map":{
"language": "javascript",
"source": "function(obj) { return [1]; }”}},
“reduce”:{
“language”: “javascript”,
“name”: “Riak.reduceSum”}]}
“`

Key filtering will support boolean operators (and, or, not), url decoding, string tokenizing, regular expressions, and various string to numeric conversions. Client support will initially be limited to the Java and Ruby clients (and Python support is already being attacked by the community). More clients will be added in subsequent releases.

Next Steps

MapReduce Query Planner

One of the biggest obstacles to improving Riak’s MapReduce performance was the way map functions were scheduled around the cluster. The original implementation was fairly naive and scheduled map functions around the vnodes in the order listed in the replica list for each bucket/key pair. This approach resulted in vnode hotspots, especially in smaller clusters, as many bucket/key pairs would hash to the same vnode. We also sent each bucket/key pair to be mapped in a separate Erlang message which reduced throughput on larger jobs as they wound up generating significant messaging traffic.

The new planner addresses many of these problems. Each batch of 50 bucket/key pairs are analyzed and scheduled around the cluster to maximize vnode coverage. In other words, the planner schedules many bucket/key pairs onto a common vnode in a single message. This reduces the chattiness of jobs overall and also improves throughput as the underlying map dispatch code can operate in batches rather than single values.

Segregated Javascript VM Pools

Contention for Javascript VMs in a busy cluster can be a significant source of performance problems. The contention is caused by each cluster node having a single pool of Javascript VMs for all Javascript calls: map functions, reduce functions, and pre-commit hooks.

0.14 supports three separate pools of Javascript VMs to reduce overall contention. By tweaking a few lines of code in your app.config file, users will be able to tailor the size of each pool to their particular needs. Does your app use MapReduce and ignore hooks? Turn the hook pool size down to zero and save yourself some CPU and memory. Do you always submit MapReduce jobs to a particular node in the cluster? You can bump up the reduce pool size on the node receiving the jobs while setting it to zero on the other nodes. This uses the fact that reduce phases aren’t distributed to use resources where they are most needed in the cluster.

As you can see, we’ve put a lot of work into refining MapReduce in the latest release, and we’re dedicated to continuing this work in upcoming releases. If you want to get your hands dirty with MapReduce right now, check out:

Enjoy!

The Basho Team

Riak 0.14 Released

January 1, 2011

Happy New Year, Happy Wednesday, and Happy Riak 0.14! It’s a new year and it’s time for a new version of Riak. We’ve been putting the final touches on the latest release candidate for the last few days and we are now ready to call Riak 0.14, a.k.a “Dakota,” official.

Here’s the rundown of the large improvements (for those of you who just want the release notes, stop reading and click here):

MapReduce Enhancements

As promised, we put some significant development time towards the robustness, performance, and stability of MapReduce in 0.14. There are three primary areas worth mentioning:

  1. Key Filtering – Until now, MapReduce jobs were only able to run over entire buckets or a specific set of user-supplied keys. This approach can result in some performance hiccups as your buckets and inputs grow. Key Filtering, which is new in this release, enables you to build meaningful data into your keys and then filter for a given set of keys before processing, thus focusing the inputs for job and increasing performance. Key filtering will support boolean operators (and, or, not), url decoding, string tokenizing, regular expressions, and various string to numeric conversions.
  2. MapReduce Query Planner – Scheduling functions is hard. Our approach to several components in the scheduling process in previous Riak releases was less than optimal, so we’ve done a lot of work to refine the process. The new query planner batches each set of 50 bucket/key pairs that are then analyzed and scheduled around the cluster to maximize vnode coverage. This yielded a nice reduction in cluster chattiness while improving throughput. Win Win™.
  3. Segregated Javascript VM Pools – 0.14 will support three separate pools of Javascript VMs to reduce overall contention. Why three separate pools? For the three different JS calls: map functions, reduce functions, and pre-commit hooks. This fine-grained level of tweaking will let you better allocate resources and improve cluster performance.

This slide deck talks more about these three enhancements. And, there is a lengthier blog post coming out tomorrow dedicated to these MapReduce improvements…

Cluster and Node Debugging

The ability to monitor and debug a running Riak cluster received some substantial enhancements in 0.14. This is because Riak is now shipping with two new applications: riak_err and cluster_info. Basho hacker Scott Fritchie posted a blog back in November with an extensive overview of what these two applications will make possible when running Riak. Read that for all the details. The short version is that a) riak_err improves Riak’s runtime robustness by strictly limiting the amount of RAM that is used while processing event log messages and b) cluster_info assists troubleshooting by automatically gathering lots of environment, configuration, and runtime statistics data into a single file.

We’ve also added some new documentation to the wiki on the Command Line Tools page (at the bottom) with some more details on what cluster_info is all about and how use it.

Windowed Merges for Bitcask

The default storage backend for Riak is Bitcask, and we are increasingly seeing users select this for their production clusters thanks to (among other things) its low, predictable latencies and high throughput. Bitcask saw numerous enhancements and bug fixes in 0.14, the most significant of which is something called “windowed merges.” Bitcask performs periodic merges over all non-active files to compact the space being occupied by old versions of stored data. In certain situations this can cause some memory and CPU spikes on the Riak node where the merge is taking place. To that end, we’ve added the ability to specify when Bitcask will perform merges. So, for instance, if you know that you typically see the lowest load on your cluster between 2 and 4 AM, you can set this time frame as your acceptable start and stop time for merges. This is set in your bitcask.app file.

Other Noteworthy Enhancements

Other noteworthy enhancements include support for HTTPS and multiple HTTP IPs, packaging scripts for building debs, rpms and Solaris packages, and the ability to list buckets through the REST API. Check out the release notes for a complete list of new features and bug fixes.

Contributors for 0.14

Aside from the core Basho Devs, here is the list[1] of people (in no particular order) who contributed code, bug fixes and other bits between 0.13 and 0.14 (across all the OTP apps that come bundled with Riak):

Tuncer Ayaz, Jebu Ittiachen, Ben Black, Jesper Louis Andersen, Fernando Benavides, Magnus Klaar, Mihai Balea, Joseph Wayne Norton, Anthony Ramine, David Reid, Benjamin Nortier, Alexey Romanov, Adam Kocoloski, Juhani Rankimies, Andrew Thompson, Misha Gorodnitzky, Daniel Néri, andekar, Kostis Sagonas, Phil Pirozhkov, Benjamin Bock, Peter Lemenkov.

Thanks for your contributions! Keep them coming.

1 – If I forgot or misspelled your name, email mark@basho.com and we’ll add/fix it ASAP.

Hey, what about Riak Search?!

We’ve got a few release-related loose ends to tie up with Riak Search. But don’t worry. This release was very significant for Search, and we’re shooting to have it tagged and released next week.

So what should you do now?

We’re already hard at work on the next release. We’re calling it “Elgin.” (Bonus Riak T shirt for anyone who can find the pattern behind the naming scheme; Dakota and Elgin should be enough info to go on.) If you want to get involved with Riak, join the mailing list or come hang out in the Riak channel on IRC to get your feet wet.

Other than that, thanks for using Riak!

The Basho Team

Webinar Recap and Q&A – Schema Design for Riak

December 8, 2010

Thank you to all who attended the webinar yesterday. The turnout was great, and the questions at the end were also very thoughtful. Since I didn’t get to answer very many, I’ve reviewed all of the questions below, in no particular order.

Q: Can you touch on upcoming filtering of keys prior to map reduce? Will it essentially replace the need for one to explicitly name the bucket/key in a M/R job? Does it require a bucket list-keys operation?

Key filters, in the upcoming 0.14 release, will allow you to logically select a population of keys from a bucket before running them through MapReduce. This will be faster than a full-bucket map since it only loads the objects you’re really interested in (the ones that pass the filter). It’s a great way to make use of meaningful keys that have structure to them. So yes, it does require an list-keys operation, but doesn’t replace the need to be explicit about which keys to select; there are still many useful queries that can be done when the keys are known ahead of time.

For more information on key-filters, see Kevin’s presentation on the upcoming MapReduce enhancements.

Q: How can you validate that you’ve reached a good/valid KV model when migrating a relational model?

The best way is to try out some models. The thing about schema design for Riak that turns your process on its head is that you design for optimizing queries, not for optimizing the data model. If your queries are efficient (single-key lookup as much as possible), you’ve probably reached a good model, but also weigh things like payload size, cost of updating, and difficulty manipulating the data in your application. If your design makes it substantially harder to build your application than a relational design, Riak may not be the right fit.

Q: Are there any “gotchas” when thinking of a bucket as we are used to thinking of a table?

Like tables, buckets can be used to group similar data together. However, buckets don’t automatically enforce data structure (columns with specified types, referential integrity) like relational tables do; that part is still up to your application. You can, however, add precommit hooks to buckets to perform any data validation that your application shouldn’t handle.

Q: How would you create a ‘manual index’ in Riak? Doesn’t that need to always find unique keys?

One basic way to structure a manually-created index in Riak is to have a bucket specifically for the index. Keys in this bucket correspond to the exact value you are indexing (for fuzzy or incomplete values,
use Riak Search). The objects stored at those keys have links or lists of keys that refer to the original object(s). Then you can find the original simply by following the link or using MapReduce to extract and find the related keys.

The example I gave in the webinar Q&A was indexing users by email. To create the index, I would use a bucket named users_by_email. If I wanted to lookup my own user object by email, I’d try to fetch the object
at users_by_email/sean@basho.com, then follow the link in it (something like </riak/users/237438-28374384-128>; riaktag="indexed") to find the actual data.

Whether those index values need to be unique is up to your application to design and enforce. For example, the index could be storing links to blog posts that have specific tags, in which case the index need not be unique.

To create the index, you’ll either have to perform multiple writes from your application (one for the data, one for the index), or add a commit hook to create and modify it for you.

Q: Can you compare/contrast buckets w/ Cassandra column families?

Cassandra has a very different data model from Riak, and you’ll want to consult with their experts to get a second opinion, but here’s what I know. Column families are a way to group related columns together that you will always want to retrieve together, and is something that you design up-front (it requires restarting the cluster for changes to take effect). It’s the closest thing to a relational table that Cassandra has.

Although you do use buckets to group similar data items, in contrast, Riak’s buckets:

  1. Don’t understand or enforce any internal structure of the values,
  2. Don’t need to be created or designed ahead of time, but pop into existence when you first use them, and
  3. Don’t require a restart to be used.

Q: How would part sharing be achieved? (this is a reference to the example given in the webinar, Radiant CMS)

Radiant shares content parts only when specified by the template language, and always by inheritance from ancestor pages. So if the layout contained <r:content part="sidebar" inherit="true" />, then if the currently rendering page doesn’t have that content part, it will look up the hierarchy until it finds it. This is one example of why it’s so important to have an efficient way to traverse the site hierarchy, and why I presented so many options.

Q: What is the max number of links an object can have for Link Walking?

There’s no cut-and-dry answer for this. Theoretically, you are limited only by storage space (disk and RAM) and the ability to retrieve the object from the desired interface. In a practical sense this means that the default HTTP interface limits you to around 100,000 links on a single object (based on previous discussions of the limits of HTTP packets and header lengths). Still, this is not going to be reasonable to deal with in your application. In some applications we’ve seen links on the order of hundreds per object negatively impact link-walking performance. If you need to have that many, you’ll be better off exploring other designs.

Again, thanks for attending! Look for our next webinar coming in about month.

Sean, Developer Advocate

Introducing Riak Function Contrib

December 2, 2010

A short while ago I made it known on the Riak Mailing list that the Basho Dev team was working on getting “more resources out the door to help the community better use Riak.” Today we are pleased to announce that Riak Function Contrib, one of these resources, is now live and awaiting your usage and code contributions.

What is Riak Function Contrib?

Riak Function Contrib is a home for MapReduce, Pre- and Post-Commit, and “Other” Functions or pieces of code that developers are using or might have a need for in their applications or testing. Put another way, it’s a Riak-specific function repository and library. Riak developers are using a lot of functions in a ton of different ways. So, we built Function Contrib to promote efficient development of Riak apps and to encourage a deeper level of community interaction through the use of this code.

How do I use it?

There are two primary ways to make use of Riak Function Contrib:

  1. Find a function – If, for instance, you needed a Pre-Commit Hook to validate a JSON document before you store it in Riak, you could use or adapt this to your needs. No need to write it yourself!
  2. Contribute a Function – There are a lot of use cases for Riak. This leads to many different functions and code bits that are written to extend Riak’s functionality. It you have one (or 20) functions that you think might be of use to someone other than you, contribute it. You’ll be helping developers everywhere. Francisco Treacy and Widescript, for example, did their part when they contributed this JavaScript reduce function for Sorting by Fields.

What’s Next?

Riak Function Contrib is far from being a complete library. That’s where you come in. If you have a function, script, or some other piece of code that you think may be beneficial to someone using Riak (or Riak Search for that matter), we want it. Head over to the Riak Function Contrib Repo on GitHub and check out the README for details.

In the meantime, the Basho Dev team will continue polishing up the site and GitHub repo to make it easier to use and contribute to.

We are excited about this. We hope you are, too. As usual, thanks for being a part of Riak.

More to come…

Mark

Community Manager

Free Webinar – Schema Design for Riak – Dec 7 at 2PM Eastern

December 1, 2010

Moving applications to Riak involves a number of changes from the status quo of RDBMS systems, one of which is taking greater control over your schema design. You’ll have questions like: How do you structure data when you don’t have tables and foreign keys? When should you denormalize, add links, or create MapReduce queries? Where will Riak be a natural fit and where will it be challenging?

We invite you to join us for a free webinar on Tuesday, December 7 at 2:00PM Eastern Time to talk about Schema Design for Riak. We’ll discuss:

  • Freeing yourself of the architectural constraints of the “relational” mindset
  • Gaining a fuller understanding of your existing schema and its queries
  • Strategies and patterns for structuring your data in Riak
  • Tradeoffs of various solutions

We’ll address the above topics and more as we design a new Riak-powered schema for a web application currently powered by MySQL. The presentation will last 30 to 45 minutes, with time for questions at the end.

If you missed the previous version of this webinar in July, here’s your chance to see it! We’ll also use a different example this time, so even if you attended last time, you’ll probably learn something new.

Fill in the form below if you want to get started building applications on top of Riak!

Sorry, registration is closed! Video of the presentation will be posted on Vimeo after the webinar has ended.

The Basho Team

Where To Find Basho This Week

October 26, 2010

Basho is hosting one event this week and participating in another. Here are the details to make sure everyone is up to speed:

A NOSQL Evening in Palo Alto

Tonight there will be a special edition of the Silicon Valley NoSQL Meetup, billed as “A NOSQL Evening in Palo Alto.” Why do I say “special”? Because this month’s event has been organized by the one and only Tim Anglade as part of his NoSQL World Tour. And this is shaping up to be one of the tour’s banner events.

Various members of the Basho Team will be in attendance and Andy Gross, our VP of Engineering, will be representing Riak on the star-studded panel.

There are almost 200 people signed up to see this discussion as it’s sure to be action-packed and informative. If you’re in the area and can make it out on short notice, I would recommend you attend.

October San Francisco Riak Meetup

On Thursday night, from 7-9, we are holding the October installment of the San Francisco Riak Meetup. Like last month, the awesome team at Engine Yard has once again been gracious enough to offer us their space for the event.

We have two great planned talks for this month. The first will be Basho hacker Kevin Smith talking about a feature of Riak that he has had a major hand in writing: MapReduce. Kevin is planning to cover everything from design to new code demos to the road map. In short, this should be exceptional.

For the second half of Thurday’s meetup we are going to get more interactive than usual. Articulation of use cases and database applicability is still something largely unaddressed in our space. So we thought we would address it. We are inviting people to submit use cases in advance of the meetup with some specific information about their apps. The Basho Developers are going to do some work before the event analyzing the use cases and then, with some help from the crowd, determine if and how Riak will work for a given use case – and if Riak isn’t the right fit, we might even help you find one that is. If you are curious whether or not Riak is the right database for that Facebook-killer you’re planning to build, now is your chance to find out. We still have room for one or two more use cases, so even if you’re not going to be able to attend the Thursday’s meetup I want to hear from you. Follow the instructions on the meetup page linked above to submit a use case.

That said, if you are in the Bay Area on Thursday night and want to have some beer and pizza with a few developers who are passionate about Riak and distributed systems, RSVP for the event. You won’t be disappointed.

Hope to see you there!

Mark

Free Webinar – Riak with Rails – August 5 at 2PM Eastern

July 29, 2010

Ruby on Rails is a powerful web framework that focuses on developer productivity. Riak is a friendly key value store that is simple, flexible and scalable. Put them together and you have lots of exciting possibilities!

We invite you to join us for a free webinar on Thursday, August 5 at 2:00PM Eastern Time (UTC-4) to talk about Riak with Rails. In this hands-on webinar, we’ll discuss:

  • Setting up a new Rails 3 project for Riak
  • Storing, retrieving, manipulating key-value data from Ruby
  • Issuing map-reduce queries
  • Creating rich document models with Ripple
  • Using Riak as a distributed cache and session store

The presentation will last 30 to 45 minutes, with time for questions at the end. Fill in the
form below if you want to get started building Rails applications on top of Riak!

Sorry, registration is closed.

The Basho Team

Free Webinar – MapReduce Querying in Riak – July 22 at 2PM

July 15, 2010

Map-Reduce is a flexible and powerful alternative to declarative query languages like SQL that takes advantage of Riak’s distributed architecture. However, it requires a whole new way of thinking about how to collect, process, and report your data, and is tightly coupled to how your data is stored in Riak.

We invite you to join us for a free webinar on Thursday, July 22 at 2:00PM Eastern Time (UTC-4) to talk about Map-Reduce Querying in Riak. We’ll discuss:

  • How Riak’s Map-Reduce differs from other systems and query languages
  • How to construct and submit Map-Reduce queries
  • Filtering, extracting, transforming, aggregating, and sorting data
  • Understanding the efficiency of various types of queries
  • Building and deploying reusable Map-Reduce function libraries

We’ll cover the above topics in conjunction with practical examples from sample applications. The presentation will last 30 to 45 minutes, with time for questions at the end.

Fill in the form below if you want to get started building applications with Map/Reduce on top of Riak!

Sorry, registration has closed!

The Basho Team