Tag Archives: Riak Search

Riak Community Release Notes v0.3

June 4, 2012

I’m thrilled to announce that the v0.3 Riak Community Release Notes are now official. (For some history on the Community Release Notes, go here.) This installment covers what happened in the community from (approximately) May 4 through June 1. Some of the many stand-out accomplishments from this release:

We did a lot more. Take a few minutes to read up. Also, we’re already rolling with the 0.4 Release Notes (which will cover June 2 up through July 1). You’re encouraged to contribute to past, present, and future release notes, so don’t hold back.

Enjoy, and thanks for being a part of Riak.

Mark

Follow Up To Riak And Python Webinar

August 3, 2011

Thanks to everyone who attended yesterday’s webinar on Riak and Python. If you missed the webinar, we’ve got you covered. Find a screencast of the webinar below, or check it out directly on Vimeo. Sorry that it’s missing the questions I answered at the end. I recapped the questions in written form below to make up for it.

We made the slides available for your viewing pleasure as well. But most importantly, have a look at our Python library for Riak, which got a lot of feature love lately, and at the source code for Riagi, the sample Django application. It utilizes Riak for session and file storage, and Riak Search storing user and image metadata.

Thanks to dotCloud for providing the hosting for Riagi. A little birdie told us they’re working on supporting Riak as part of their official stack soon.

The Python client for Riak wouldn’t exist without the community, so we want to say thank you for your contributions, and we’re always ready for more pull requests!

Keep an eye out for a new release of the Python client this week, including several of the new features shown in the webinar!

The two questions asked at the end were:

  • Is there a Scala client for Riak? Links relevant to the answer in the video: Outdated Scala Library, a more recent fork, and our official Java client.
  • Is Protocol Buffers more efficient than using HTTP? Answer is detailed in the webinar video.

Boosting Riak Search Query Performance With Inline Fields

July 18, 2011

(This was originally posted on Ryan Zezeski’s working blog “Try Try Try)

In this post I want to give a quick overview of inline fields, a recent addition to Riak Search that allows you to trade-off disk space for a considerable performance bump in query execution and throughput. I’m going to assume the reader is already familiar with Search. In the future I may do a Search overview. If you would like that then ping me on twitter.

The Goal

Recently on the Riak Users Mailing List there was a discussion about improving the performance of Search when executing intersection (i.e. AND) queries where one term has a low frequency and the other has a high frequency. This can pose a problem because Search needs to run through all the results on both sides in order to provide the correct result. Therefore, the query is always bounded by the highest frequency term. This is exasperated further by the fact that Search uses a global index, or in other words partitions the index by term. This effectively means that all results for a particular term are pulled sequentially from one node. This is opposed to a local index, or partitioning by document, which effectively allows you to parallelize the query across all nodes. There are trade-offs for either method and I don’t want to discuss them in this blog post. However, it’s good to keep in mind 1. My goal with this post is to show how you can improve the performance of this type of query with the current version of Search 2.

What’s an “Inline” Field, Anyways?

To properly understand inline fields you need to understand the inverted index data structure 3. As a quick refresher the gist is that the index is a map from words to a list of document reference/weight pairs. For each word 4 the index tells you in which documents it occurs and its “weight” in relation to that document, e.g. how many times it occurs. Search adds a little twist to this data structure by allowing an arbitrary list of properties to be tacked onto each of these pairs. For example, Search tracks the position of each occurrence of a term in a document.

Inline fields allow you to take advantage 5 of this fact and store the terms of a field directly in the inverted index entries 6. Going back to my hypothetical query you could mark the field with the frequently occurring term as inline and change the AND query to a query and a filter. A filter is simply an extra argument to the Search API that uses the same syntax as a regular query but makes use of the inline field. This has the potential to drop your latency dramatically as you avoid pulling the massive posting 7 altogether.

WARNING: Inline fields are not free! Think carefully about what I just described and you’ll realize that this list of inline terms will be added to every single posting for that index. If your field contains many terms or you have many inline fields this could become costly in terms of disk space. As always, benchmarking with real hardware on a real production data set is recommended.

The Corpus

I’ll be using a set of ~63K tweets that occurred in reaction to the the devastating earthquake that took place in Haiti during January of 2010. The reason I choose this data-set is because it’s guaranteed to have frequently occurring terms such as “earthquake” but also has low occurring terms 7 such as the time the tweets were created.

The Rig

All benchmarks were run on a 2GHz i7 MBP with an SSD 8. An initial run is performed to prime all systems. Essentially, everything should be coming from FS cache meaning I’ll mostly be testing processing time. My guess is disk I/O would only amplify the results. I’ll be using Basho Bench and running it on the same machine as my cluster. My cluster consists of four Riak nodes (obviously, on the same machine) which I built from master 9.

If you’d like to run the benchmarks on your own hardware please see the RUN_BENCHMARKS.md file.

Naive Query

"text:earthquake"

The naive query asks for every document id 10 that includes the word earthquake. This should return 62805 results every time.

Naive

Scoped Query

"text:earthquake AND created_at:[20100113T032200 TO 20100113T032500]"

The scoped query still searches for all documents with the term earthquake but restricts this set further to only those that were created in the provided three minute time span.

Scoped

Scoped Query With Filtering

"created_at:[20100113T032200 TO 20100113T032500]" "text:earthquake"

This is the same as the scoped query except earthquake is now a filter, not a query. Notice, unlike the previous two queries, there are two strings. The first is the query the second is the filter. You could read that in English as:

Execute the query to find all tweets created in this three minute range. Then filter that set using the inline field “text” where it contains the term “earthquake.”

Scoped & Filter

Wait One Second!

Just before I was about to consider this post wrapped up I realized my comparison of inline vs. non-inline wasn’t quite fair. As currently implemented, when returning postings the inline field’s value is included. I’m not sure if this is of any practical use outside the filtering mechanism but this means that in the case of the naive and scoped queries the cluster is taking an additional disk and network hit by carrying all that extra baggage. A more fair comparison would be to run the naive and scoped queries with no inline fields. I adjusted my scripts and did just that.

Naive With No Inlining

Naive No Inline

Scoped With No Inlining

Scoped No Inline

Conclusions

In this first table I summarize the absolute values for throughput, 99.9th percentile and average latencies.

Stat Naive (I) Naive Scoped (I) Scoped Scoped Filter
Thru (op/s) 2.5 3.5 3 5 15
99.9% (ms) 875 490 575 350 42
Avg (ms) 800 440 530 310 25

In this benchmark I don’t care so much about the absolute numbers as I do how they relate to each other. In the following table I show the performance increase of using the scoped filter query versus the other queries. For example, the scoped filter query has three times the throughput and returns in 1/12th of the time, on average, as compared to the scoped query. That is, even its closest competitor has a latency profile that is an order of magnitude worse. You may find it odd that I included the naive queries in this comparison but I wanted to show just how great the difference can be when you don’t limit your result set. Making a similar table comparing naive vs. scoped might be useful as well but I leave it as an exercise to the reader.

Stat Naive (I) Naive Scoped (I) Scoped
Thru 6x 4x 5x 3x
99.9% 20x 11x 13x 8x
Avg 32x 17x 21x 12x

In conclusion I’ve done a drive-by benchmark showing that there are potentially great gains to be had by making use of inline fields. I say “potentially” because inline fields are not free and you should take the time to understand your data-set and analyze what trade-offs you might be making by using this feature. In my example I’m inlining the text field of a twitter stream so it would be useful to gather some statistics such as what are the average number of terms per tweet and what is the average size of each term? Armed with that info you then might determine how many tweets you plan to store, how many results a typical query will match and how much extra I/O overhead that inline field is going to add. Finally, run your own benchmarks on your own hardware with real data while profiling your system’s I/O, CPU, and memory usage. Doing anything else is just pissing in the wind.

Ryan

References

1: If you’d like to know more you could start by reading Distributed Query Processing Using Partitioned Inverted Files.

2: Inline fields were added in 14.2, but my benchmarks were run against master.

3: I like the introduction in Effect of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems.

4: In search parlance a word is called a term and the entire list of terms is called the vocabulary.

5: Or abuse, depending on your disposition.

6: Entries in an inverted index are also called postings by some people.

7: Or high cardinality, depending on how you want to look at it.

8: Just like when dynoing a car it’s constant conditions and relative improvement that matter. Once you’re out of the shop those absolute numbers don’t mean much.

9: The exact commit is 3cd22741bed9b198dc52e4ddda43579266a85017.

10: BTW, in this case “document” is a Riak object indexed by the Search Pre-commit hook.

Riak Search Explained

May 9, 2011

At last month’s San Francisco Riak Meetup, Basho Developer Advocate Dan Reverri gave an extensive and action-packed presentation on Riak Search. This talk goes from a basic overview of what Riak Search is and does, up through running a Riak Search Sample application. If you’re at all interested in what Riak Search can do for your applications, this video is well worth your 35 minutes. You can also get a PDF copy of Dan’s slides here.

When you’re done with the video:

NOTE: In the presentation Dan states that Riak Search’s standard analyzer supports stemming. This is not entirely true. The standard analyzer currently does not support stemming, but Riak Search allows you to call out to Java Lucene analyzers, which *do* support stemming. Sorry for any confusion.

Enjoy.

Mark

Riak Search Explained from Basho Technologies on Vimeo.

Why MapReduce is Easy

March 30, 2011

There’s something about MapReduce that makes it seem rather scary. It almost has this Big Data aura surrounding it, making it seem like it should only be used to analyze a large amount of data in a distributed fashion. It’s one of the pieces that makes Riak a pretty versatile key-value store. Feed a bunch of keys into it, and do some analytics on the objects, quite handy.

But when you narrow it down to just the basics, MapReduce is pretty simple. I’m almost 100% certain even that you’ve used it in one way or another in an application you’ve written. So before we go all distributed, let’s break MapReduce down into something small that you can use every day. That certainly has helped me understand it much better.

For our webinar on Riak and Node.js we built a little application with Node.js and Riak Search to store and search syslog messages. It’s called Riaktant and handily converts and stores syslog messages in a way that’s friendlier for both Riak Search and MapReduce. We’ll base this on examples we used in building the application.

MapReduce is easy because it works on simple data

MapReduce loves simple data structures. Why? Because when there are no deep, nested relationships between say, objects, distributing data for parallel processing is a breeze. But I’m getting a little ahead of myself.

Let’s take the data Riaktant stores in Riak and see how easy it is to sift through it without even having to go distributed. It uses a JavaScript library called glossy to parse a syslog message and turn it into this nice JSON data structure.

javascript
message = {
"originalMessage": "<35>1 2011-02-14T11:10:25.137+01:00 lb1.basho.com ftpd 7003 - Client disconnected",
"time": "2011-02-14T10:10:25.137Z",
"severityID": 3,
"facility": "auth",
"version": 1,
"prival": 35,
"host": "lb1.basho.com",
"facilityID": 4,
"message": "7003 - Client disconnected",
"severity": "err"
}

MapReduce is easy because you use it every day

I’m almost 100% certain you use MapReduce every day. If not daily, then at least once a week. Whenever you have a list of items that you loop or iterate over and transform into something else one by one, if only to extract a single attribute, there’s your map function.

Keeping with JavaScript, here’s how you’d extract the host from the above JSON, for a whole list:

“`javascript
messages = [message];

messages.map(function(message) {
return message.host
}))
“`

Or, if you insist, here’s the Ruby equivalent:

ruby
messages.map do |message|
message[:host]
end

If you must ask, here’s Python, using a list comprehension, for added functional programming sugar:

python
[message['hello'] for message in messages]

There, so simple, right? Halfway there to some full-fledged MapReduce action.

MapReduce is easy because it’s just code

Before we continue, let’s add another syslog message.

javascript
message2 = {
"originalMessage": "<35>1 2011-02-14T11:10:25.137+01:00 web2.basho.com ftpd 7003 - Client disconnected",
"time": "2011-02-14T10:12:37.137Z",
"severityID": 3,
"facility": "http",
"version": 1,
"prival": 35,
"host": "web2.basho.com",
"facilityID": 4,
"message": "7003 - Client disconnected",
"severity": "warn"
}
messages.push(message2)

We can take the above example even further (still using JavaScript), and perform some additional operations like result sorting, for example.

javascript
messages.map(function(message) {
return message.host
}).sort()

This gives us a nice sorted list of hosts. Coincidentally, sorting happens to be the second step in traditional MapReduce. Isn’t it nice how easily this is coming together?

The third and last step involves, you guessed it, more code. I don’t know about you, but I love things that involve code. Let’s reduce the list of hosts and count the occurrences of each host, (and if this reminds you of an SQL query that involves GROUP BY, you’re right on track).

“`
var reduce = function(total, host) {
if (host in total) {
total[host] += 1
} else {
total[host] = 1
}
return total
}

messages.map(function(message) {
return message.host
}).sort().reduce(reduce, {})
“`

There’s one tiny bit missing for this to be as close to MapReduce as we can get without going distributed. We need to slice up the list before we hand it to the map function. As JavaScript doesn’t have a built-in function to partition a list we’ll whip up our own real quick. After all, we’ve come this far.

function chunk(list, chunkSize) {
for(var position, i = 0, chunk = -1, chunks = []; i < list.length; i++) {
if (position = i % chunkSize) {
chunks[chunk][position] = list[i]
} else {
chunk++;
chunks[chunk] = [list[i]]
}
}
return chunks;
}

It loops through the list, splitting it up into equally sized chunks, returning them neatly wrapped in a list.

Now we can chunk the initial list of messages, and boom, we have our own little MapReduce going, without magic, just code. Let’s put the new chunk function to good use.

javascript
var mapResults = [];
chunk(messages, 2).forEach(function(chunk) {
var messages = chunk.map(function(message) {
return message.host
})
mapResults = mapResults.concat(messages)
})
mapResults.sort().reduce(reduce, {})

We split up the messages into two chunks, run the map function for each chunk, collecting the results as we go. Then we sort the results and feed them into the reduce function. That’s MapReduce in eight lines of JavaScript code. Easy, right?

That’s all there’s to MapReduce. You use it every day, whether you’re aware of it or not. It works nicely with simple data structures, and it’s just code.

Unfortunately, things get complicated as soon as you go distributed, for example in a Riak cluster. But we’ll save that for the next post, where we’ll examine why MapReduce is hard.

Mathias

Free Webinar – Riak with Node.js – March 15 @ 2PM Eastern

March 8, 2011

JavaScript is the lingua franca of the web, and many developers are starting to use node.js to power their server-side applications. Riak is a flexible, scalable database that has a JavaScript-friendly interface, including MapReduce in JavaScript and an awesome client library called riak-js. Put the two together and you have lots of possibilities!

We invite you to join us for a free webinar on Tuesday, March 15 at 2:00PM Eastern Time (UTC-4) to talk about Riak with node.js. In this webinar, we’ll discuss:

  • Getting riak-js, the Riak client for node.js, into your application
  • Storing, retrieving, manipulating key-value data
  • Issuing MapReduce queries
  • Finding data with Riak Search
  • Testing your code with the TestServer

We’ll address the above topics in addition to looking at a sample application. The presentation will last 30 to 45 minutes, with time for questions at the end. Fill in the form below if you want to get started building node.js applications on top of Riak!

Intro to Riak Webcast

October 1, 2012

New to Riak? Join us this Thursday for an intro to Riak webcast with Mark Phillips, Basho director of community, and Shanley Kane, director of product management. In this 30 minute talk we’ll cover the basics of:

Good and bad use cases for Riak

  • Some user stories of note
  • Riak’s architecture: consistent hashing, hinted handoff, replication, gossip protocol and more
  • APIs and client libraries
  • Features for searching and aggregating data: Riak Search, Secondary Indexes and Map Reduce
  • What’s new in the latest version of Riak

Register for the webcast here.

Shanley

Mark

Intro to Riak Webcast this Thursday

**October 22, 2012**

If you missed our last ‘Intro to Riak’ webcast, not to fear, we’re doing another. This Thursday (11am PT / 2pm ET) , join Shanley, Basho’s director of product marketing, and Mark, director of coumminity, for a 30 minute webcast introducing Riak’s architecture, use cases, user stories, operations and data model.

Register for the webcast [here](http://info.basho.com/IntroToRiakOct25.html).

[Shanley](http://twitter.com/shanley)

[Mark](http://twitter.com/#!/pharkmillups)

Webcast- Intro to Riak, Thursday January 17

**January 10, 2013**

Join us on Thursday, January 17 at 11 am PT / 2 pm ET for an intro to Riak webcast with Shanley Kane, director of product management and Mark Phillips, director of community management.

In 30 minutes, we’ll cover:

* Good and bad use cases for Riak
* User stores of note, including social, content, advertising and session storage
* Riak’s architecture and mechanisms for remaining highly available in failure conditions
* APIs, data model and client libraries
* Features for querying and searching data
* What’s new in the latest version of Riak and what’s next

[Sign up for the webcast here](http://info.basho.com/IntroToRiakJan17.html).

[Basho](http://twitter.com/basho)

Why I Am Excited About Riak Search

October 20, 2010

Last week Basho released Riak 0.13, including (among many other great improvements) the first public release of the much-anticipated Riak Search. There are a number of reasons why I am very excited by this.

The first and most obvious reason why Riak Search is exciting is that it’s an excellent piece of software for solving a large class of data retrieval needs. Of course, this isn’t the first search system that is clustered and can grow by adding servers; that idea isn’t groundbreaking or very exciting on its own. However, it is the first such search system that I know of with the powerful and robust systems model that people have come to treasure in the Riak key/value store.

Being able to grow & shrink and handle failures in an easy and predictable way is much less common in search systems than in “simpler” data management systems. After we demonstrated that we could build something (anything!) with that kind of easy scalability and availability, our friends and customers began to ask if we could apply our ideas to some richer data models. Specifically, a common pattern began to emerge: people would deploy Riak and an indexing system such as Apache Solr system side-by-side. This was a workable solution, but could be operationally frustrating. The systems could get out of sync, capacity planning became more complicated, and most importantly the operations story around failure management became much more challenging. If only we could make the overall collection of systems as good at availability management as Riak was, then a new class of problems would be solved.

Those conversations began a journey of exploration and experimentation. This essential phase was led by John Muellerleile, one of the most creative and resourceful programmers I know. He did all of the early work, finding places where the state of the art in indexing and search could be merged with the “Riak Way” of doing things. More recently, amazing work was done by the entire Basho Engineering team to make Riak Search move from prototype to product.

Riak Search has only been out for about a week, but users are already discovering that they can use it in more than one way; it can function in fulltext-analysis, as an easy way to produce simple inverted indices over semi-structured data, and more.

That’s enough reason to be excited about Riak Search, but I have bigger reasons as well.

Riak Search is the first public demonstration that Riak Core is a meaningful base on which to build distributed systems beyond just a key/value store. By using the same central code base for distribution, dispatch, ownership, failure management, and node administration, we are able to confidently make many of the same guarantees for Search that we have made all along for Riak’s key/value storage. Even if Search itself wasn’t such a compelling product, it is exciting as a proof of the value of Riak Core.

That value hints at Riak’s future — not as a single database but as a family of distributed systems for storing, managing, and retrieving data. We’ve now gone from one to two such systems, but we’re not stopping there. The work of creating Search was really two efforts: building Search itself and also the breakout & improvement of our Core. We can (and will) use that improved Core to build new systems in the future.

The subtlest, but perhaps most important, of the exciting things about Search is that it also uses Core to show how each new Riak system is greater than the sum of its parts. Riak Search is not just a search system using the same Core codebase as KV, it is running on the same actual nodes as KV. This allows us to develop features that don’t make sense in KV alone or in Search alone, but that take advantage of the shared running elements of Core. For instance, users can issue a search/map/reduce query that runs map/reduce style parallel processing with data locality on a dataset determined by a search result. As we develop further systems on Riak Core, we expect further such connections to make each one also benefit the entire Riak family in this way.

What we have released in the recent past is exciting. The future is even more exciting.

Justin