April 7, 2014
This month, it’s all about developer conferences – both local and international. If you’re in the area, come and say hi. We always love to chat Riak and can answer any questions you may have. If you aren’t going to be at any of these events, check out the Riak Mailing List for questions or Contact Us to help get started with Riak.
Here is a look at where we’ll be this month.
PyCon 2014: On the Wednesday (April 9th) of PyCon 2014, Basho Technical Evangelist, Tom Santero, will host a free workshop on “Building Applications on Riak,” starting at 1:30pm. PyCon takes place April 9-17 in Montreal, Canada.
ChefConf 2014: Basho will be attending ChefConf and our technical evangelists will be available to answer any Riak questions you may have. ChefConf 2014 takes place April 15-17 in San Francisco, CA.
CRAFT Conference: Basho is a proud sponsor of CRAFT Conference, which takes place April 23-25 in Budapest, Hungary. We will have a booth setup so stop by to grab some great swag and chat Riak.
NoSQL Matters: At NoSQL Matters: Cologne, Basho Technical Evangelist, Joel Jacobson, will present on “CRDTs in Riak,” one of the new features available with Riak 2.0. His talk will take place at 12:30pm on April 30th. NoSQL Matters: Cologne takes place April 29-30 in Cologne, Germany.
For a full list of where we’ll be, both in April and beyond, visit our Events Page.
October 15, 2013
Ripple has not been maintained because we’ve learned that it’s not the right way to think about Riak. Using the riak-client APIs directly leads to better applications. We’re moving Ripple to basho-labs to avoid confusion.
The Ripple State of Mind
The Ripple document-relational mapper tool for Riak allows you to treat Riak objects like Ruby objects, very similarly to how ActiveRecord lets you treat Postgres rows like Ruby objects. This neglects the fundamental differences between Postgres and Riak, and encourages developers to use Riak badly.
SQL is a nice fit for Rails-like object usage because adding indexes isn’t prohibitively expensive, querying with indexes is cheap, and there’s a query planner that can use or mix indexes when available and can resort to a table scan when they’re not. Ripple, while it does have secondary index (2i) support, doesn’t have a planner to do set math on multiple indexes, so you either get to implement that yourself or write composite indexes. Adding an index after you have a dataset in production is hard too; it either only applies to new data or requires an expensive migration step, in which you load and re-save old records.
Ripple doesn’t provide any way to use some Riak 1.4 and planned 2.0 features, such as streaming 2i, multi-get, 2i return terms, and CRDTs. Ripple also doesn’t make it easy to make your frontend vector clock aware, which limits its usefulness in scenarios that create siblings.
These are complex features, and trying to wrap them in Ripple won’t necessarily make them easier to use.
My experience with complex Rails-style applications is that models eventually grow a bunch of class and instance methods to handle cases that are awkward for the ORM layer. An ActiveRecord model might have a SQL-using method as an optimization for a specific use case, or instance methods that perform a write and return something sensible in the case of a Postgres constraint violation.
Rails applications that want to use SQL without using ActiveRecord’s ORM can do so: just useconnection.select_all and write some SQL. With Ripple, you can always drop down to riak-ruby-client and do work that way.
With that in mind, instead of the generic Riak 1.0 feature set in Ripple, we recommend wrapping Riak client methods in model objects. This exposes more complexity initially but, as your application grows and evolves, provides better opportunities to integrate new Riak features that help queryability, denormalize to reduce the number of Riak interactions that have to be done, automate certain data types, or provide consistency guarantees in appropriate situations.
We’re moving Ripple to the basho-labs organization on GitHub to accurately reflect its status as unmaintained and deprecated.
September 4, 2013
For more background on the indexing techniques described, check out our blog “Index for Fun and for Profit“
The War Against Zombies is Still Raging!
In the United States, the CDC has recovered 1 million Acute Zombilepsy victims and has asked for our help loading the data into a Riak cluster for analysis and ground team support.
Know the Zombies, Know Thyself
The future of the world rests in a CSV file with the following fields:
- Full Name
- Zip Code
- National ID
- Feet Inches
For each record, we’ll serialize this CSV document into JSON and use the National ID as the Key. Our ground teams need the ability to find concentrations of recovered zombie victims using a map so we’ll be using the Zip Code as an index value for quick lookup. Additionally, we want to enable a geospatial lookup for zombies so we’ll also GeoHash the latitude and longitude, truncate the hash to four characters for approximate area lookup, and use that as an index term. We’ll use the G-Set Term-Based Inverted Indexes that we created since the dataset will be exclusively for read operations once the dataset has been loaded. We’ve hosted this project at Github so that, in the event we’re over taken by zombies, our work can continue.
In our load script, we read the text file and create new zombies, add Indexes, then store the record:
Our Zombie model contains the code for serialization and adding the indexes to the object:
Let’s run some quick tests against the Riak HTTP interface to verify that zombie data exists.
First let’s query for a known zombilepsy victim:
curl -v http://127.0.0.1:8098/buckets/zombies/keys/427-69-8179
Next, let’s query the inverted index that we created. If the index has not been merged, then a list of siblings will be displayed:
Zip Code for Jackson, MS:
curl -v -H "Accept: multipart/mixed" http://127.0.0.1:8098/buckets/zip_inv/keys/39201
GeoHash for Washington DC:
curl -v -H "Accept: multipart/mixed" http://127.0.0.1:8098/buckets/geohash_inv/keys/dqcj
Excellent. Now we just have to get this information in the hands of our field team. We’ve created a basic application which will allow our user to search by Zip Code or by clicking on the map. When the user clicks on the map, the server converts the latitude/longitude pair into a GeoHash and uses that to query the inverted index.
Colocation and Riak MDC will Zombie-Proof your application
First we’ll create small Sinatra application with the two endpoints required to search for zip code and latitude/longitude:
Our zombie model does the work to retrieve the indexes and build the result set:
Saving the world, one UI at a time
Searching for zombies in the Zip Code 39201 yields the following:
Clicking on Downtown New York confirms your fears and suspicions:
The geographic bounding inherent to GeoHashes is obvious in a point-dense area so, in this case, it would be best to query the adjacent GeoHashes.
Keep Fighting the Good Fight!
There is plenty left to do in our battle against zombies!
- Create a Zombie Sighting Report System so the concentration of live zombies in an area can quickly be determined based on the count and last report date.
- Add a crowdsourced Inanimate Zombie Reporting System so that members of the non-zombie population can report inanimate zombies. Incorporate Baysian filtering to prevent false reporting by zombies. They kind of just mash on the keyboard so this shouldn’t be too difficult.
- Add a correlation feature, utilizing Graph CRDTs, so we can find our way back to Patient Zero.
August 28, 2013
What is an Index?
In Riak, the fastest way to access your data is by its key.
However, it’s often useful to be able to locate objects by some other value, such as a named collection of users. Let’s say that we have a user object stored under its username as the key (e.g.,
thevegan3000) and that this particular user is in the
Administrators group. If you wanted to be able to find all users, such as
thevegan3000 who are in the Administrators group, then you would add an index (let’s say,
user_group) and set it to
administrator for those users. Riak has a super-easy-to-use option called Secondary Indexes that allows you to do exactly this and it’s available when you use either the LevelDB or Memory backends.
Using Secondary Indexes
Secondary Indexes are available in the Riak APIs and all of the official Riak clients. Note that
user_group_bin when accessing the API because we’re storing a binary value (in most cases, a string).
Add and retrieve an index in the Ruby Client:
In the Python Client:
In the Java Client:
More Example Use Cases
Not only are indexes easy to use, they’re extremely useful:
- Reference all orders belonging to a customer
- Save the users who liked something or the things that a user liked
- Tag content in a Content Management System (CMS)
- Store a GeoHash of a specific length for fast geographic lookup/filtering without expensive Geospatial operations
- Time-series data where all observations collected within a time-frame are referenced in a particular index
What If I Can’t Use Secondary Indexes?
Indexing is great, but if you want to use the Bitcask backend or if Secondary Indexes aren’t performant enough, there are alternatives.
A G-Set Term-Based Inverted Index has the following benefits over a Secondary Index:
- Better read performance at the sacrifice of some write performance
- Less resource intensive for the Riak cluster
- Excellent resistance to cluster partition since CRDTs have defined sibling merge behavior
- Can be implemented on any Riak backend including Bitcask, Memory, and of course LevelDB
- Tunable via read and write parameters to improve performance
- Ideal when the exact index term is known
Implementation of a G-Set Term-Based Inverted Index
A G-Set CRDT (Grow Only Set Convergent/Commutative Replicated Data Type) is a thin abstraction on the Set data type (available in most language standard libraries). It has a defined method for merging conflicting values (i.e. Riak siblings), namely a union of the two underlying Sets. In Riak, the G-Set becomes the value that we store in our Riak cluster in a bucket, and it holds a collection of keys to the objects we’re indexing (such as
thevegan3000). The key that references this G-Set is the term that we’re indexing,
administrator. The bucket containing the serialized G-Sets accepts Riak siblings (potentially conflicting values) which are resolved when the index is read. Resolving the indexes involves merging the sibling G-Sets which means that keys cannot be removed from this index, hence the name: “Grow Only”.
administrator G-Set Values prior to merging, represented by sibling values in Riak
administrator G-Set Value post merge, represented by a resolved value in Riak
Great! Show me the code!
As a demonstration, we integrated this logic into a branch of the Riak Ruby Client. As mentioned before, since a G-Set is actually a very simple construct and Riak siblings are perfect to support the convergent properties of CRDTs, the implementation of a G-Set Term-Based Inverted Index is nearly trivial.
There’s a basic interface that belongs to a Grow Only Set in addition to some basic JSON serialization facilities (not shown):
Next there’s the actual implementation of the Inverted Index. The index put operation simply creates a serialized G-Set with the single index value into Riak, likely creating a sibling in the process.
The index get operation retrieves the index value. If there are siblings, it resolves them by merging the underlying G-Sets, as described above, and writes the resolved record back into Riak.
With the modified Ruby client, adding a Term-Based Inverted Index is just as easy as a Secondary Index. Instead of using
_bin to indicate a string index and we’ll use
_inv for our Term-Based Inverted Index.
Binary Secondary Index:
zombie.indexes['zip_bin'] << data['ZipCode']
Term-Based Inverted Index:
zombie.indexes['zip_inv'] << data['ZipCode']
The downsides of G-Set Term-Based Inverted Indexes versus Secondary Indexes
- There is no way to remove keys from an index
- Storing a key/value pair with a Riak Secondary index takes about half the time as putting an object with a G-Set Term-Based Inverted Index because the G-Set index involves an additional Riak put operation for each index being added
- The Riak object which the index refers to has no knowledge of which indexes have been applied to it
- It is possible; however, to update the metadata for the Riak object when adding its key to the G-Set
- There is no option for searching on a range of values (e.g., all
See the Secondary Index documentation for more details.
The downsides of G-Set Term-Based Inverted Indexes versus Riak Search:
Riak Search is an alternative mechanism for searching for content when you don’t know which keys you want.
- No advanced searching: wildcards, boolean queries, range queries, grouping, etc
See the Riak Search documentation for more details.
Let’s see some graphs.
The graph below shows the average time to put an object with a single index and to retrieve a random index from the body of indexes that have already been written. The times include the client-side merging of index object siblings. It’s clear that although the put times for an object + G-Set Term-Based Inverted Index are roughly double than that of an object with a Secondary Index, the index retrieval times are less than half. This suggests that secondary indexes would be better for write-heavy loads but the G-Set Term-Based Inverted Indexes are much better where the ratio of reads is greater than the number of writes.
Over the length of the test, it is even clearer that G-Set Term-Based Inverted Indexes offer higher performance than Secondary Indexes when the workload of Riak skews toward reads. The use of G-Set Term-Based Inverted Indexes is very compelling even when you consider that the index merging is happening on the client-side and could be moved to the server for greater performance.
- Implement other CRDT Sets that support deletion
- Implement G-Set Term-Based Indexes as a Riak Core application so merges can run alongside the Riak cluster
- Implement strategies for handling large indexes such as term partitioning