Tag Archives: Riak Search

Riak 0.13 Released

October 10, 2010

Dedicated Riak users who were who hanging out in the IRC channel late Friday may have seen the following message from our bot that reports changes to the canonical Riak Repo on Bitbucket:

dizzyd Added tag riak-0.13.0 for changeset 957b3d601bde

That’s right! We tagged the 0.13 release of Riak this past Friday and, after of a weekend away from our laptops, are ready to announce it officially. As you’re about to learn, this was a banner release for a whole host of reasons. Here is a rundown of the noteworthy changes and enhancements in this release:

Riak Search

As most people know, we have been hard at work on Riak Search for several months and, after releasing it to a handful of users as part of a limited beta, we are finally ready to release it into the wild!

Riak Search is a distributed, easily-scalable, failure-tolerant, real-time, full-text search engine built around Riak Core and tightly integrated with Riak’s key/value layer.

At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

There is obviously much more to be written about Riak Search, and a simple blurb in this blog post won’t do it justice. It’s for this reason that we are dedicating an entire blog post to it. You can expect to see this next week. In the mean time, you can go download the code and get more usage information on the Riak Search Section.

riak_kv and riak_core

In addition to focusing on previously-unreleased bits like Search, we also strengthened the code and functionality of riak_kv, the layer that handles the key/value operations, and riak_core, the component of Riak that provides all the services necessary to write a modern, well-behaved distributed application.

Firstly, the structure of the Riak source repos has been changed such that there is a single Erlang app per repo. This permits third parties to use riak_core (the abstracted Dynamo logic) in other applications. As you can imagine, this opens up a world of possibilities and we are looking forward to seeing where this code ends up being implemented. (Check out this blog post for a primer on Riak Core.)

The consistent hash ring balancing logic has been improved to enable large clusters to converge more quickly on a common ring state. Related to this, further work has been done to improve Riak’s performance and robustness when dealing with large data sets and failure scenarios.

Also, the Riak Command Line Tools were expanded with two new commands: “ringready” and “wait-for-service.” These were put in place to help anyone administering a Riak installation to script detection of cluster convergence on a given ring and wait for components like riak_kv to become available, respectively. You can read up on all your favorite Riak Command Line Tools here.

MapReduce

Enhancing Riak’s MapReduce functionality is something on which we’ve been focusing intensely for the past several releases, and this one was no different.

We reworked how Riak’s MapReduce handled overload scenarios, especially in the event that the number of available JavaScript VMs were low. As a result of these changes, Riak now does a much better job of tracking which Javascript VMs are busy and attempts to spread the load more evenly over all VMs.

In addition to this, the caching layer for JavaScript MapReduce has been completely re-implemented. This results in performance gains when repeating the same MapReduce jobs. Specifically, this work includes a new in-memory vnode LRU cache solely for map operations. The size of the cache is now configurable (via the ‘vnode_cache_entries’ entry in the riak_kv section of app.config) and defaults to 1000 objects.

And, we improved MapReduce job input handling. In short what this means is that you will see lower resource consumption and more stable run times when the input producer exceeds the cluster’s ability to process new MapReduce inputs.

It’s also worth mentioning that there is more MapReduce goodness in the pipeline, both code and non-code. Stay tuned, because it’s only getting better!

Bitcask

Bitcask, the default storage backed for Riak, also got some huge enhancements in the 0.13 release. The most noticeable and significant improvement is that Bitcask now uses 40% less memory per key entry. The net effect is that you are required to have less RAM available in your cluster (as Bitcask stores all key data in memory). And, for those of you working with large data sets, we did some work to make starting up Bitcask significantly faster.

Listing keys, something the community has been asking for us to make more efficient, got a nice speedup when using Bitcask (which should make some of your map/reduce queries a lot snappier). This work is ongoing, too, so look for this process to get even faster as we make additional incremental improvements. Also of note, Bitcask will now reclaim memory of expired key entries.

Conclusion

Riak 0.13 is our best release yet and we are more confident than ever that it’s the best database out there for your production needs (assuming it suits your use case, of course). In addition to the changes we highlighted above, you should also take a moment to check out the complete release notes.

After you’re done with those, drop everything and do the following:

As always, let us know if you have any questions and issue, and thanks for being a part of Riak!

The Basho Team

Basho is Taking Over Baltimore This Weekend

September 29, 2010

Basho hackers will be giving quite a few presentations between now and the end of the week. And they all happen to be in Baltimore! Here is a quick rundown (in no particular order) of where we will be, who will be there, and what we will be talking about:

Rusty Klophaus at CUFP

Rusty Klophaus will be at the Commercial Users of Functional Programming (CUFP) Event taking place this weekend in Baltimore, Maryland. His talk is called “Riak Core: Building Distributed Applications Without Shared State” and it should be downright amazing.

From the talk’s description: Both Riak KV (a key-value datastore and map/reduce platform) and Riak Search (a Solr-compatible full-text search and indexing engine) are built around a library called Riak Core that manages the mechanics of running a distributed application in a cluster without requiring a central coordinator or shared state. Using Riak Core, these applications can scale to hundreds of servers, handle enterprise-sized amounts of data, and remain operational in the face of server failure.

All the details on his talk can be found here.

And, in case you haven’t been following your Riak Core developments, check out Building Distributed Systems with Riak Core.

Dave Smith at the Ninth ACM SIGPLAN Erlang Workshop

Dave Smith (a.k.a “Dizzyd”) will be keynoting the ACM SIGPLAN Erlang Workshop taking place on Friday, Sept 30th, also in Baltimore. Dave’s talk is called “Rebar, Bitcask and how chemotherapy made me a better developer.”

Rebar and Bitcask are both pieces of software that Dizzy had a major hand in creating and they have played a huge role in Riak’s adoption (not to mention that Rebar has quickly become an indispensable tool for Erlang developers everywhere). Dave was also fighting follicular lymphoma while writing a lot of this code. Needless to say, this one is sure to be memorable and of immense value.

More details on his talk can be found here.

It should also be noted that newly-minted Basho Developer Scott Fritchie is the Workshop Chair for this event. He is an accomplished Erlang developer and Riak is not the only distributed key/value store about which Scott is passionate – he will also happily talk your ear off about Hibari.

Justin Sheehy at Surge

Just when you thought they couldn’t fit another conference in Baltimore on the same weekend… And this one is big: it’s the Surge Conference put on by the team at OmniTI. Basho will be there in the form of CTO Justin Sheehy.

Justin will be giving a talk about concurrency at scale, something about which every distributed systems developer should care deeply. Additionally, he will be taking part in a panel discussion – I haven’t seen an official name for it yet but rumor has it that it’s something along the lines of “SQL versus NoSQL.”

Check out the Surge site for more conference details.

As you can see, Baltimore is the place to be this weekend. Get there at all costs. And then go download Riak.

Mark

Webinar Recap – Riak in Action with Wriaki

August 20, 2010

Thank you to those who attended our webinar yesterday. Like before, we’re recapping the questions below for everyone’s sake (in no particular order).

Q: How would solve full text search with the current versions of Riak? One could also take Wriaki as an example as most wikis have some sort of fulltext search functionality.

I recommend using existing fulltext solutions. Solr has matched up well with most of the web applications I have written, and would certainly work for Wriaki as well.

Q: Where in the course of the interaction (shown on slide 18) are you defining the client ID? Don’t you need the client ID and vclock to match between updates?

On slide 42, we talk about “actors” which are essentially client IDs. Using the logged-in user as the client ID can help prevent vclock explosion and is a sensible way of structuring your updates.

Bryan

Riak Search

May 21, 2010

This post is going to start by explaining how in-the-trenches experience with key/value stores, like Riak, led to the creation of Riak Search. Then it will tell you why you care, what you’ll get out of Riak Search, and why it’s worth waiting for.

A bit of history

Few people know that Basho used to develop applications for deployment on Salesforce.com. We had big goals, and were thinking big to fill them, and part of that was choosing a data storage system that would give us what we needed not only to succeed and grow, but to survive – a confluence of pragmatism and ideal that embodied a bulletproof operations story, a path upward — resilience, reliability, and scalability, through the use of proven science.

So, that’s what we did: we developed and used what has grown to be, and what you know today, as Riak.

Idealism can’t get you everywhere, though. While we answered hard questions with link-walking and map/reduce, there was still the desire in the back of all of our heads: sometimes you just want to ask, “What emails were sent on May 21 that included the word ‘strategy’?” without having to figure out how to walk links from an organizational chart to mailboxes to mails, and then filter over the data there. It was a pragmatic desire: we just wanted a quick answer in order to decide whether or not to spend more time chasing a path. “Less yak-shaving, please.”

The Operations Story

Then we stopped making Salesforce.com apps, and started selling Riak. We quickly found the same set of desires. Operationally, Riak is a huge win. Pragmatically, something that does indexing and search in a similar operational manner is even bigger. Thus, Riak Search was born.

The operational story is, in a nutshell, this: when you add another node to your cluster, you add capacity and compute power. That’s it, you just add another box and “it just works.” Purposefully or not, eventually a node leaves the cluster, hardware fails, whatever: Riak deals with it. If the node comes back, it’s absorbed like it never left.

We insisted on these qualities for Riak, and have continued that insistence in Riak Search. We did it with all the familiar bits: consistent hashing, hinted handoff, replication, etc.

Why Riak Search?

Now, we’ll be the first to tell you that with Riak you can get pretty far using link-walking and map/reduce, with the understanding that you know what you are going to want ahead of time, and/or are willing to wait for it.

Riak Search answers questions that pop into your head; “find me all the blue dresses that are between $20 and $30 dollars,” “find me the document Bob referred to last week at the TPS procedures meeting,” “how can I delete all these emails from my aunt that have those stupid attachments?” “find me that comic strip with Bob,” etc.

It’s about making members of the sea of data in your key-value store findable. At a higher level, it’s about agility. The ability to answer questions you have about your business and your customers without having to consult a developer or dig through reference manuals and without your application developers having to reinvent the wheel with a very real possibility of doing it just right enough to assure you nothing will go wrong. It’s about a common indexing language.

Okay, now you know — thanks for bearing with us — let’s get to the technical bits.

Riak Search …

The system we have built …

  1. is an easily-scalable, fault-tolerant search and indexing system, adhering to the operational story you just read
  2. supports full-text indexing and search
  3. allows querying via the Lucene query syntax
  4. has Solr-compatible /select and /update web-services
  5. supports date and numeric indexing
  6. supports faceting
  7. automatically distributes indexes
  8. has an intermediate query language and integrated query planner
  9. supports scoring
  10. has integrated tokenizing, filtering and analysis (yes, you can
    use StandardAnalyzer!)

… and much more. Sounds pretty great, right?

If you want to know more about the internals and technical nitty gritty, check out the Riak Search presentation one of our own, Riak Search engineer John Muellerleile, gave at the San Francisco Erlang Factory this year.

So, why don’t you have it yet? The easy part.

There are still some stubs and hard-coded things in what we have. For instance, the full-text analyzer in use is just whitespace, case-normalization, and stop-word filtering. We intend to fully support the ability to specify other Lucene analyzers, including custom modules, but the code isn’t there yet.

There is also very little documentation. Without a little bit of handholding, even the brightest and most ambitious user could be forgiven for staring blankly, lost for even the first question to ask. We’re spreading the knowledge among our own team right now; that process will generate the artifacts needed for the next round of users to step in.

There are also many fiddly, finicky bits. These are largely relics of early iterations. Rather than having the interwebs be flooded with, “How do you stop this thing?” (as it was with Riak), we’re going to make things friendlier.

So, why don’t you have it yet? The not-so-easy part.

You’ve probably asked yourself, “What of integration of Riak and Riak Search?” We have many notes from discussions about how it could or should be done, as well as code showing how it can be done. But, we’re not completely satisfied with any of our implementations so far.

There are certainly no shortage of designs and ideas on how this could or should work, so we’re going to make a final pass at refining all of our ideas, given our current Riak Search system to play with, so that we can provide a solid, extensible system, instead of one that with many rough edges that would almost certainly be replaced immediately.

Furthering this sentiment is that we think that our existing map/reduce framework and the functionality and features provided by Riak Search are a true power combo when used together intelligently, than simply as alternatives, or at worse, at odds. As a result, we’re defining exactly how Riak Search indexing and querying should be threaded into Riak map/reduce processing to bring you a combination that is undoubtedly more than the sum of its parts.

We could tease you with specifics, like generating the set of bucket/key inputs to a map phase by performing a Riak Search query, or parameterizing Search phases with map results; though, for now, amidst protest both internally — we’re chomping at the bit to get this out into the world and into your hands  - and externally, as our favorite people continually request this exact set of technology and features, we’re going to implement the few extra details from our refined notes before forcing it on you all.

Hold on just a little longer. :)

-the Riak Search Team

Collecta Chooses Riak Search

December 15, 2009

Another big announcement for the team here at Basho: Collecta, which makes a truly cool real-time streaming search engine, has chosen to use Riak Search. They are longtime Webmachine users and when they learned about Riak, they partnered with us to define Riak Search and validate the prototype.

Look for a blog post later in the day from Justin Sheehy on what it was like to work with Collecta. (Hint: it was awesome!)

Stay tuned.

Mark

Basho Podcast Number 1 – Justin Sheehy and Tony Falco on Scaling out with Riak and Riak Search

December 11, 2009

Just out: Basho’s first podcast discussing Riak. Justin Sheehy and Tony Falco revisit the definition of scalability Justin first discussed at NoSQL East 2009, discuss EC2, Riak, and Riak’s map/reduce and soon-to-be-released distributed search and indexing. As a special bonus, at 3:24 in the podcast, listen for the sound of Kevin Smith’s SMS accepting the job at Basho. The mic did not pick up Justin’s grimace. Of course, he didn’t miss a beat. “I just did, Bob….”

Enjoy,

Mark Phillips



Right click here to download the Podcast