January 22, 2015
In speaking with Riak users, both open source and commercial, we are frequently told that Riak’s key/value model is more flexible and faster to develop against than a traditional relational database. Even though Riak is well suited for many applications, there are inevitable tradeoffs in terms of query options and data types that are available. With a key/value model, there is no concept of columns or rows, therefore Riak does not have join operations. Riak can be queried either directly via HTTP, the protocol buffers API and through various client libraries. However, there is no SQL or SQL-like language that is currently available.
Riak’s key/value data model does not preclude queryability. There are several powerful querying options including:
- Riak Search: Integration with Apache Solr provides full-text search and support for Solr’s client query APIs.
- Secondary Indexes: Secondary Indexes (2i) give developers the ability to tag an object stored in Riak with one or more query values. Indexes can be either integers or strings, and can be queried by either exact matches or ranges of values.
- MapReduce: Developers can leverage Riak MapReduce for tasks like filtering documents by tag, counting words in documents, and extracting links to related data.
For more information, check out the Riak documentation on Querying Data.
The table below illustrates key/value mappings for common application types. Remember that values in Riak are opaque and stored on disk as binaries – JSON or XML documents, images, text, etc. The way data is organized in Riak should take into account the unique needs of the application, including access patterns such as read/write distribution, latency differences between various operations, use of Riak features (including MapReduce, Search, Secondary Indexes), and more.
|Session||User/Session ID||Session Data|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, eMail, UUID||User Attributes|
|Content||Title, Integer||Text, JSON/XML/HTML Document, Images, etc.|
Consider, for example, one of the canonical use cases for Riak…storing user and session data. In a relational database, the “users” table is well known and, basically, provides a unique identifier per user, and then a series of identifying information about that user as individual columns such as:
- First name
- Last name
- Counter of Site Visits
- Paid Account Identifier
This data can then be used to correlate or count, paid users, common interests, etc. via a series of SQL queries against the row/column structure of the users table.
Riak, in contrast, provides flexibility in how this data can be modeled based upon the application use case. It may be desirable to create a Users bucket, with the UserName (or Unique Identifier) as the key and a JSON object storing all user attributes as the value. Or, as we describe in Data Modeling with Riak Data Types, leverage the power of Riak Data Types by creating a map type for each user storing:
- first and last name strings in the register type,
- interests as a set,
- a counter for visits,
- and a flag for paid account identifier.
One of the best ways to enable application interaction with objects (a key/value pair) in Riak is to provide structured bucket and key names for the objects. This approach often involves wrapping information about the object in the object’s location data itself.
For example, appending a timestamp, UUID, or Geographical coordinate, to a key’s name allows for fine grained queryability via simple lookup to locate and retrieve a specific set of information. Leveraging the same naming mechanism as created for users (UniqueID as the key) enables, in a separate sessions bucket, storing the UUID append with a timestamp as the key and the session data (in binary format) as the object. In this way, using the same UUID, I am able to obtain both user and session data stored in different buckets and in different formats.
For additional information, and more complex considerations such as modeling relationship and advanced social applications, see the Riak documentation on use cases and data modeling.
Resolving Data Conflicts
In any system that replicates data, conflicts can arise – e.g., if two clients update the same object at the exact same time or if not all updates have yet reached hardware that is experiencing lag. Riak is “eventually consistent” – while data is always available, not all replicas may have the most recent update at the same time, causing brief periods (generally on the order of milliseconds) of inconsistency while all state changes are synchronized.
However, Riak does provide features to detect and help resolve the statistically small number of incidents when data conflicts occur. When a read request is performed, Riak looks up all replicas for that object. By default, Riak will return the most updated version, determined by looking at the object’s vector clock. Vector clocks are metadata attached to each replica when it is created. They are extended each time a replica is updated to keep track of versions. Clients can also be allowed to resolve conflicts themselves.
Further, when an outdated object is discovered as part of a read request, Riak will automatically update the out-of-sync replica to make it consistent. Read Repair, a self-healing property of the database, will even update a replica that returns a “not_found” in the event that a node loses it due to physical failure.
Riak also features “Active Anti-Entropy,” which is an automatic self-healing property that runs in the background. Rather than waiting for a read request to trigger a replica repair (as with Read Repair), Active Anti-Entropy constantly uses a hash tree exchange to compare replicas of objects and automatically repairs or updates any that are divergent, missing, or corrupt. This can be beneficial for large clusters storing “stale” data.
More information on vector clocks, dotted version vectors, and conflict resolution can be found in the online documentation in the section regarding Causal Context.
Multi-site replication is quickly becoming critical for many of today’s platforms and applications. Not only does replication across multiple clusters provide geographic data locality – the ability to serve global traffic at low-latencies – it can also be an integral part of a disaster recovery or backup strategy. Other teams may use multi-site replication to maintain secondary data stores, both for failover as well as for performing intensive computation without disrupting production load. Multi-site replication is included in Basho’s commercial extension to Riak, Riak Enterprise, which also includes 24/7 support.
Multi-site replication in Riak works differently than the typical approach seen in the relational world, multi-master replication. In Riak’s multi-datacenter replication, one cluster acts as a “primary cluster.” The primary cluster handles replication request from one or more “secondary clusters” (generally located in datacenters in other regions or countries). If the datacenter with the primary cluster goes down, a secondary cluster can take over as the primary cluster. In this sense, Riak’s multi-datacenter capabilities are “masterless.”
In multi-datacenter replication, there are two primary modes of operation: full sync and real-time. In full sync mode, a complete synchronization occurs between primary and secondary cluster(s). In real-time mode, continual, incremental synchronization occurs – replication is triggered by new updates. Full sync is performed upon initial connection of a secondary cluster, and then periodically (by default, every 6 hours). Full sync is also triggered if the TCP connection between primary and secondary clusters is severed and then recovered.
Data transfer is unidirectional (primary->secondary). However, bidirectional synchronization can be achieved by configuring a pair of connections between clusters.
Full documentation for multi-datacenter replication in Riak Enterprise is available in the online documentation.
Modeling data in any non-relational solution requires a different way of thinking about the data itself. Rather than an assumption that all data cleanly fits into a structure of rows and columns, the data domain can be overlayed on the core Key/Value store (Riak) in a variety of ways. There are, however, distinct tradeoffs and benefits to understand.
Relational Databases have:
- Foreign keys and constraints
- Sophisticated query planners
- Declarative query language (SQL)
- A Key/Value model where the value is any unstructured data
- More data redundancy that provides better availability
- Eventual consistency
- Simplified query capabilities
- Riak Search
What you will gain:
- More flexible, fluid designs
- More natural data representations
- Scaling without pain
- Reduced operational complexity
For more information on Data Modeling, or to chat with a member of the Basho team on the topic, please request a Tech Talk.
In a previous post we briefly introduced Riak 2.0 data types. The addition of these distributed Data Types simplifies application development by automatically handling sibling resolution. This means developers can spend less time thinking about the complexities of vector clocks and sibling resolution and, instead, let Data Types support their applications’ data access patterns.
Understanding these data types requires a brief trip through history…
Riak 1.4 Counters
Riak 1.4 introduced counters as the first data types. Prior to 1.4 we’ve always said: “Your data is opaque to Riak,” — and it still can be — but with the addition of counters that is not longer the case. Riak knows what is stored in a counter key, and how to increment and decrement it through the counter API. It isn’t necessary to fetch, mutate, or put a counter. Instead you just incremented by 5 or decremented by 100. Vector Clocks, as discussed in the post entitled Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems, as Riak knew how to merge concurrent writes there was never a sibling created.
Counters are very valuable, but you can not build many applications on just counters. Now, in Riak 2.0, we’ve added more data types. We believe that, with the addition of these data types you can model many applications’ data storage needs with greater simplicity, and never have to write sibling merge functions again.
What are CRDTs?
You may have heard a Basho presentation, or blog post, reference “CRDTs”. CRDT stands for (variously) Conflict-free Replicated Data Type, Convergent Replicated Data Type, Commutative Replicated Data Type, and others. The key, repeated, phrase is “Replicated Data Types”.
Replication is inherent in Riak. It is what the n-value defines. It is part of what lends to the availability and fault tolerance characteristics that Riak provides. Data Types are a common construct in computing. Sets, Bags, Lists, Registers, Maps, Counters…etc.
That leaves us to consider the “C”.
Conflict Free, or “Opaque No More”
Riak is an eventually consistent system. It leans, very much, towards the AP end of the CAP spectrum. (For more reading on the topic, the Practical Tradeoffs section of A Little Riak Book is particularly illuminating). This availability is achieved with mechanisms like sloppy quorum writes to fallback nodes. However, even without partitions and many nodes, interleaved or concurrent writes can lead to conflicts. Traditionally, Riak keeps all values and presents them to the user to resolve. The client application must have a deterministic way to resolve conflicts. It might be to pick the highest timestamp, or union all the values in a list, or something more complex. Whatever approach is chosen, it is ad-hoc, and created specifically for the data model and application at hand.
With Riak data types, there is still “conflict”. However, the resolution for that conflict is inherent and part of the data type’s design. The data types for Riak 2.0 converge automatically, at write and read time, on the server. If a client application can model its data using the data types provided, no sibling values will be seen and there is no longer a need to write ad-hoc, custom merge functions.
When modeling an applications data domain in a programming language, developers are familiar with composing state from a few primitive data types. Riak Data Types give the developer that power back and expressivity, and relieve them of the burden of design and testing deterministic merge functions. The key is that the data is no longer opaque to Riak. When the Data Types API is leveraged, Riak “knows” what type of thing is being stored and is able to perform the merge automatically.
When reading a Data Type from Riak, you will only ever see a single value. That value is still eventually consistent, but it will be as correct as it can be given the amount of entropy in the database. When the system is stable, all values will converge on a single, deterministic, correct value.
What Data Types Are Available?
Riak 2.0 includes the following Data Types:
- Counters: as in Riak 1.4
- Flags: enabled/disabled
- Sets: collections of binary values
- Registers: named Binary values with values also binary
- Maps: a collection of fields that supports the nesting of multiple Data Types
The conflict resolution, as discussed above, is intrinsic to the Data Type itself. This table provides greater detail.
|Data Type||Use Cases||Conflict Resolution Rule|
||Each actor keeps and independent count for increments and decrements. Upon merge, the pairwise maximum of any two actors will win (e.g. if one actor holds 172 and other holds 173, 173 will win upon merge)|
||Enable wins over disable|
||If an element is concurrent added and removed the add will win|
||The most chronologically recent value wins, based on timestamps|
||If a field is concurrently added, or updated and removed, the addd / update will win|
A new version of Riak, with new Data Types, allowing you to model your application in more expansive ways. Take these Data Types for a spin and be sure to let us know how you use them in your applications.
December 11, 2013
In the world of distributed systems, there are still a lot of unsolved problems and improvements to be made. This means that there is a lot of interesting research being done at top institutions around the world – with some of the brightest minds looking to improve distributed systems. At RICON West, Basho’s developer conference, we brought three PhD students and candidates to speak, whose work on distributed systems has been vital to both Basho and the future of the industry.
Peter Bailis is a PhD student at UC Berkeley. His talk, “Bad As I Wanna Be: Coordination and Consistency in Distributed Databases,” goes into how to reason about the trade-offs between coordination, consistency, latency, and availability, with a focus on practical takeaways from recent research both at UC Berkeley and beyond. He also talks about reconciling “consistency” in NoSQL and ACID databases and explains why, even though you probably didn’t “beat the CAP Theorem,” you (and tomorrow’s database designs) may be on to something. His full talk is below.
Lindsey Kuper is a PhD candidate at Indiana University, who studies the foundations of deterministic parallel programming. At RICON, she spoke on “LVars: Lattice-Based Data Structures for Deterministic Parallelism,” which introduces LVars (data structures that enable deterministic parallel programming). LVars generalize the single-assignment variables often found in deterministic parallel languages to allow multiple assignments that are monotonically increasing with respect to a user-specified lattice of states. LVars maintain determinism by allowing only monotonic writes and “threshold” reads to and from shared data. Her talk looks at examples of programming in an LVar-based parallel language that is provably deterministic, and explores the connection between LVars and CRDTs. The complete talk is below.
Finally, we had Diego Ongaro, a PhD student at Stanford University, talk about “The Raft Consensus Algorithm.” His talk discusses Raft, a consensus algorithm designed for understandability and developed by Diego and Professor John Ousterhout. Raft is equivalent to Paxos in fault-tolerance and performance, but it’s designed to be as easy to understand as possible, while cleanly addressing all major pieces needed for practical systems. The hope is that Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a wider variety of higher quality consensus-based systems than are available today. You can learn more about Raft below.
To watch all of the sessions from RICON West 2013, visit the Basho Technologies Youtube Channel.
November 12, 2013
The first email:
Occasionally, riak seems to not store an object I try to save. I have run tcpdump on the node receiving the request to ensure it is receiving the http packets with the correct JSON from the client. When the issue occurs the node is in fact receiving the request with the correct JSON.
Riak is designed to accommodate server and network failures without ever losing committed writes, so this led to a quick response from Basho’s engineers.
After some discussion, a vital piece of information was revealed:
One other thing that might be worth mentioning here is the writes I’m mentioning are actually updates to existing objects. The object exists, an attempt to write an update for the object appears to be received by a node, but the object maintains it’s original value.
Riak was dropping updates rather than writes, which is a horse of a different color. To see why updates are much more problematic for any distributed database, read on.
In a database that runs on a single server, setting aside any complications introduced by transactions or locks, the second of two updates to the same record will overwrite the first. Last write wins.
With Riak’s simplest conflict resolution behavior, the second of two updates to the same object may or may not overwrite the first, even if those two updates are spaced far apart. Last write wins, except when it doesn’t, but even then it does.
The problem is simple: there is no reliable definition of “last write”; because system clocks across multiple servers are going to drift.
On a single server, there’s one canonical clock, regardless of accuracy. The system can always tell which write occurred in which order (assuming that the clock is always increasing; setting a clock backwards can cause all sorts of bad behavior).
So, back to our original problem with lost updates:
The nodes were a bit out of synch (up to 30 seconds… looking into why ntp wasn’t working!). So far it appears this was the issue.
If two updates to the same object occur within 30 seconds in such an environment, the end result is unpredictable.
Taming the Beast
The conclusion drawn from the discussion was to implement (and, hopefully, to monitor) time synchronization. This is a step in the right direction, and one that every distributed system should implement, but there are more powerful and instructive lessons to impart.
Some of this discussion requires awareness of siblings, vector clocks, and related arcana. If you wish to read more about these topics, Basho’s earlier blog post Understanding Riak’s Configurable Behaviors: Part 1 provides sufficient context. (You can find links in the epilogue to the full series, but part 1 covers the necessary background for this post.)
If instead you decide you’d like to avoid reading about and dealing with such complexities entirely, skip over the Nitty Gritty section to The Land of Milk and Honey.
One approach that should generally be employed when writing Riak applications is to supply vector clocks with each update. It’s not clear in this particular scenario that it would have helped, but it certainly can’t hurt. Giving Riak more information to track causal history is never a bad thing.
Forcing the Last Write to Win
A rather non-obvious approach is to take the default last write wins conflict resolution one step further.
As discussed in part 1 of the configurable behaviors blog series, there are two closely-related configuration parameters that determine how Riak approaches conflict resolution:
last_write_wins. The former indicates whether Riak should keep all conflicts for the client to resolve; the latter is our concern at the moment.
allow_mult is set to
true will instruct Riak to always overwrite existing objects, ignoring the timestamps stored with them.
So, nominally, this achieves what we earlier implied to be impossible: the last write truly does win, regardless of clock consistency.
The problem is that we’ve just punted the problem down the road a bit. Yes, all servers that receive an object will blindly write it, but any servers that don’t receive it due to network partition or server failure will still retain an older value, and depending on clock consistency the older value may still win once the network or server failure is corrected.
Broadly speaking, if you’re going to have data consistency problems, it’s best for that to be obvious and easily detectable during testing stages. This “solution'; would have made the situation much harder to recognize before production.
Stopping Last Write Wins
At least in part to limit the complexity of developing applications, Basho decided to specify Riak’s default configuration as
allow_mult=false, which requires the database to resolve conflicting writes internally.
As we’ve seen, Riak isn’t exactly a genius at resolving conflicting writes. Beyond the challenges of clock consistency, Riak treats objects as opaque and has no awareness of business logic.
It’s almost always better to bite the bullet: instruct Riak to retain all conflicting updates as siblings (via
allow_mult=true) and write your application to deal with them appropriately.
Note: We are planning to change the default setting for
true in Riak 2.0, but please check the documentation and your configuration before assuming either behavior.
The Land of Milk and Honey
Distributed data types
Creating data types that can survive network partitions and self-heal has long been a goal for our engineers. With Riak 1.4, Basho introduced distributed counters; with 2.0, Riak will have a larger suite of distributed data types that can resolve conflicts internally, notably including sets and maps.
Although 2.0 is not yet released, a technical preview is available.
It is also possible to define such Riak Data Types (known formally as CRDTs) at the application layer. See the two-part blog series Index for Fun and for Profit and Indexing the Zombie Apocalypse With Riak for more information.
Also with 2.0, Riak will include the option of designating certain data as strongly consistent, meaning that the servers that hold a piece of data will have to agree on any updates to that data.
As appealing as that may sound, it is impossible to guarantee strong consistency without introducing coordination overhead and constraining Riak’s ability to continue to allow for requests when servers or networks have failed.
And aren’t low latency and high availability the reasons you’re using Riak?
The Silver(*) Bullet: Immutability
(* or at least stainless steel)
The rise of “big data” is linked to a resurgence of interest in functional programming, which is particularly well-suited for processing large data sets. (See Dean Wampler’s Lambda Jam talk Copious Data for an interesting exposition of this idea.)
One of the key tenets of functional programming is that data is immutable, meaning that destructive updates are not (typically) allowed.
The relational data model does not offer much (any?) support for immutable data, but it is a powerful concept. At Basho’s inaugural RICON conference Pat Helland gave a talk entitled Immutability Changes Everything which goes into more detail.
While it isn’t necessarily true that immutability solves everything with distributed systems, it’s a great start. Without data updates, there are no conflicts.
See the configurable behaviors epilogue (specifically, the discussion of Datomic) for a discussion of configuration tweaks to Riak to take better advantage of immutable data for low latency.
If your distributed system isn’t explicitly dealing with data conflicts, any correct behavior it exhibits is more a matter of good luck than of good design.
If your distributed database relies on clocks to pick a winner, you’d better have rock-solid time synchronization, and even then, it’s unlikely your business needs are served well by blindly selecting the last write that happens to arrive.
Riak provides powerful tools for helping address the inherent challenges of distributed data, but they have to be used to be useful.
October 29, 2013
Throughout RICON West, we will be discussing many of the Riak 2.0 features (both in track sessions or during lightning talks), so keep your eyes on the live stream over the next two days. Videos of all sessions will also be made available after the conference.
Here is a look at some of the major enhancements available in Riak 2.0:
- Riak Data Types. Building on the eventually consistent counters introduced in Riak 1.4, Riak 2.0 adds sets and maps as new distributed data types. These Riak Data Types simplify application development without sacrificing Riak’s availability and partition tolerance characteristics.
- Strong Consistency. Developers have the flexibility to choose whether buckets should be eventually consistent (the default Riak configuration today that provides high availability) or strongly consistent, based on data requirements.
- Full-Text Search Integration with Apache Solr. Riak Search is completely redesigned in Riak 2.0, leveraging the Apache Solr engine. Riak Search in 2.0 supports the Solr client query APIs, enabling integration with a wide range of existing software and commercial solutions.
- Security. Riak 2.0 adds the ability to administer access rights and utilize plug-in authentication models. Authentication and Authorization is provided via client APIs.
- Simplified Configuration Management. Riak 2.0 continues to improve Riak’s operational simplicity by changing how, and where, configuration information is stored in an easy-to-parse and transparent format.
- Reduced Replicas for Multiple Data Centers. Riak Enterprise 2.0 can optionally store fewer copies of replicated data across multiple data centers to better maintain a balance between storage overhead and availability.
Ready to get started? Download the Technical Preview.
Please note that this is only a Technical Preview of Riak 2.0. This means that it has been tested extensively, as we do with all of our release candidates, but there is still work to be completed to ensure it’s production hardened. Between now and the final release, we will be continuing manual and automated testing, creating detailed use cases, gathering performance statistics, and updating the documentation for both usage and deployment.
Riak 2.0 Technical Preview: Deep Dive
Riak Data Types
In distributed systems, we are forced to trade consistency for availability (see: CAP Theorem) and this can complicate some aspects of application design. In Riak 2.0, we have integrated cutting-edge research on data types known as called CRDTs (Conflict-Free Replicated Data Types) pioneered by INRIA to create Riak Data Types. By adding counters, sets, maps, registers, and flags, these Riak Data Types enable developers to spend less time thinking about the complexities of vector clocks and sibling resolution and, instead, focusing on using familiar, distributed data types to support their applications’ data access patterns.
A more detailed overview of Riak Data Types is available that examines implementation considerations and the basics of usage.
In all prior versions, Riak was classified as an eventually consistent system. With the 2.0 release, Riak now lets developers choose when operations should be strongly or eventually consistent. This gives developers a choice between these semantics for different types of data. At the same time, operators can continue to enjoy the operational simplicity of Riak. Consistency preferences are defined on a per bucket type basis, in the same cluster.
A RICON West 2012 talk entitled, Bringing Consistency to Riak, shares much of the initial thinking behind this effort. In addition, the pull request that adds consistency to
riak_kv provides detailed information about related repositories and the implementation approach.
Redesigned Full-Text Search
Riak is a key/value store and the values are simply stored on disk as binary. With previous versions of Riak Search, Riak developers have long been able to index the content of these stored values. In Riak 2.0, Riak Search (code-named Yokozuna) has been completely redesigned and now uses the Apache Solr full-text document indexing engine directly. Together, Riak and Solr provide a reliable full-text context indexing solution that is highly available and built for scale. In addition, Riak Search 2.0 also fully supports the Solr client query APIs, which enables integration with existing software solutions (either homegrown or commercial).
The Basho engineers responsible for Yokozuna have created a resources page that includes recorded talks, Solr documentation links, and books on the topic.
Basho designed Riak with critical data in mind. Whether it’s data that affects revenue, user experience, or even a patient’s health (as is the case with the NHS), Riak ensures that this critical data is always available. However, often this critical data is also sensitive data. Riak 2.0 adds security to this data through the ability to administer access rights and plug-in various secure authentication models commonly used today.
The initial RFC that describes the security effort, including related Pull Requests, is available at github.com/basho/riak/issues/355.
Simplified Configuration Management
At Basho, we pride ourselves on providing operationally friendly software that functions smoothly when dealing with the challenges of a distributed system. In the past, configuration of Riak occurred in two files:
vm.args. Riak 2.0 changes how and where configuration information is stored. It no longer uses Erlang-specific syntax but, rather, provides a layout more suited for all operators and automated deployment tools. This layout is easy to parse and transparent for Riak administrators.
More information on the vision and specific implementation considerations are contained in the repository at github.com/basho/cuttlefish.
In versions of Riak prior to 2.0, keys were made up of two parts: the bucket they belong to and a unique identifier within that bucket. Buckets act as a namespace and allow for similar keys to be grouped. In addition, they provide a means of configuring how previous versions of Riak treated that data.
In Riak 2.0, several new features (security and strong consistency in particular) need to interact with groups of buckets. To this end, Riak 2.0 includes the concept of a Bucket Type. In addition to allowing new features without special prefixes in Bucket names, Riak developers and operators are able to define a group of buckets that share the same properties and only store information about each Bucket Type, rather than individual buckets.
More information about Bucket Types can be found in the Github Issue at github.com/basho/riak/issues/362. This issue describes the planned functionality, discussions about implementation, and includes related pull requests.
Change in Defaults for Sibling Resolution
Riak has always supported both application-side and timestamp and vector clock-based Last Write Wins server-side resolution. Prior to Riak 2.0, vector clock-based Last Write Wins has been the default. Moving forward, new clusters will hand off siblings to applications by default. This is the safest way to work with Riak, but requires developers to be aware of sibling resolution.
More Efficient Use of Physical Memory
Riak nodes are designed to manage the changing demands of a cluster as it experiences network, hardware, and other failures. To do this, Riak balances each node’s resources accordingly. Riak 2.0 has vastly improved LevelDB’s use of available physical memory (RAM) by allowing local databases to dynamically change their cache sizes as the cluster fluctuates under load.
In the past, it was necessary to specify RAM allocation for different LevelDB caches independently. This is no longer the case. In Riak 2.0, LevelDB databases that manage key/value or active anti-entropy data share a single pool of memory, and administrators are free to allocate as much of the available RAM to LevelDB as they feel is appropriate in their deployment. Detailed implementation documentation can be found in the basho/leveldb wiki.
Riak Ruby Vagrant Project
If you are interested in testing Riak 2.0, in a contained environment with the Riak Ruby Client, Basho engineer Bryce Kerley has put together the Riak-Ruby-Vagrant repository. In addition, this environment can be easily adapted to usage with other clients for testing the new features of Riak 2.0.
September 11, 2013
The second episode of Think Distributed, a podcast focused on distributed systems, is now available. This episode discusses causality, vector clocks, version vectors, and CRDTs.
This episode’s panelists are:
- Chris Meiklejohn, Software Engineer at Basho Technologies
- Peter Bailis, Graduate Student at UC Berkeley
- Carlos Baquero, HASLab, INESC Tec and Universidade do Minho
- Marek Zawirski, Graduate Student at UPMC-LIP6 / INRIA Paris
In addition to participating on this panel, Peter Bailis will be speaking at RICON West, Basho’s distributed systems conference. His talk, “Bad As I Wanna Be: Coordination and Consistency in Distributed Databases” will discuss how to reason about the trade-offs between coordination, consistency, latency, and availability, with a focus on practical takeaways from recent research both at Berkeley and beyond. RICON West will take place in San Francisco from October 29-30th. Tickets are still available here: ricon-west-2013.eventbrite.com/
August 28, 2013
What is an Index?
In Riak, the fastest way to access your data is by its key.
However, it’s often useful to be able to locate objects by some other value, such as a named collection of users. Let’s say that we have a user object stored under its username as the key (e.g.,
thevegan3000) and that this particular user is in the
Administrators group. If you wanted to be able to find all users, such as
thevegan3000 who are in the Administrators group, then you would add an index (let’s say,
user_group) and set it to
administrator for those users. Riak has a super-easy-to-use option called Secondary Indexes that allows you to do exactly this and it’s available when you use either the LevelDB or Memory backends.
Using Secondary Indexes
Secondary Indexes are available in the Riak APIs and all of the official Riak clients. Note that
user_group_bin when accessing the API because we’re storing a binary value (in most cases, a string).
Add and retrieve an index in the Ruby Client:
In the Python Client:
In the Java Client:
More Example Use Cases
Not only are indexes easy to use, they’re extremely useful:
- Reference all orders belonging to a customer
- Save the users who liked something or the things that a user liked
- Tag content in a Content Management System (CMS)
- Store a GeoHash of a specific length for fast geographic lookup/filtering without expensive Geospatial operations
- Time-series data where all observations collected within a time-frame are referenced in a particular index
What If I Can’t Use Secondary Indexes?
Indexing is great, but if you want to use the Bitcask backend or if Secondary Indexes aren’t performant enough, there are alternatives.
A G-Set Term-Based Inverted Index has the following benefits over a Secondary Index:
- Better read performance at the sacrifice of some write performance
- Less resource intensive for the Riak cluster
- Excellent resistance to cluster partition since CRDTs have defined sibling merge behavior
- Can be implemented on any Riak backend including Bitcask, Memory, and of course LevelDB
- Tunable via read and write parameters to improve performance
- Ideal when the exact index term is known
Implementation of a G-Set Term-Based Inverted Index
A G-Set CRDT (Grow Only Set Convergent/Commutative Replicated Data Type) is a thin abstraction on the Set data type (available in most language standard libraries). It has a defined method for merging conflicting values (i.e. Riak siblings), namely a union of the two underlying Sets. In Riak, the G-Set becomes the value that we store in our Riak cluster in a bucket, and it holds a collection of keys to the objects we’re indexing (such as
thevegan3000). The key that references this G-Set is the term that we’re indexing,
administrator. The bucket containing the serialized G-Sets accepts Riak siblings (potentially conflicting values) which are resolved when the index is read. Resolving the indexes involves merging the sibling G-Sets which means that keys cannot be removed from this index, hence the name: “Grow Only”.
administrator G-Set Values prior to merging, represented by sibling values in Riak
administrator G-Set Value post merge, represented by a resolved value in Riak
Great! Show me the code!
As a demonstration, we integrated this logic into a branch of the Riak Ruby Client. As mentioned before, since a G-Set is actually a very simple construct and Riak siblings are perfect to support the convergent properties of CRDTs, the implementation of a G-Set Term-Based Inverted Index is nearly trivial.
There’s a basic interface that belongs to a Grow Only Set in addition to some basic JSON serialization facilities (not shown):
Next there’s the actual implementation of the Inverted Index. The index put operation simply creates a serialized G-Set with the single index value into Riak, likely creating a sibling in the process.
The index get operation retrieves the index value. If there are siblings, it resolves them by merging the underlying G-Sets, as described above, and writes the resolved record back into Riak.
With the modified Ruby client, adding a Term-Based Inverted Index is just as easy as a Secondary Index. Instead of using
_bin to indicate a string index and we’ll use
_inv for our Term-Based Inverted Index.
Binary Secondary Index:
zombie.indexes['zip_bin'] << data['ZipCode']
Term-Based Inverted Index:
zombie.indexes['zip_inv'] << data['ZipCode']
The downsides of G-Set Term-Based Inverted Indexes versus Secondary Indexes
- There is no way to remove keys from an index
- Storing a key/value pair with a Riak Secondary index takes about half the time as putting an object with a G-Set Term-Based Inverted Index because the G-Set index involves an additional Riak put operation for each index being added
- The Riak object which the index refers to has no knowledge of which indexes have been applied to it
- It is possible; however, to update the metadata for the Riak object when adding its key to the G-Set
- There is no option for searching on a range of values (e.g., all
See the Secondary Index documentation for more details.
The downsides of G-Set Term-Based Inverted Indexes versus Riak Search:
Riak Search is an alternative mechanism for searching for content when you don’t know which keys you want.
- No advanced searching: wildcards, boolean queries, range queries, grouping, etc
See the Riak Search documentation for more details.
Let’s see some graphs.
The graph below shows the average time to put an object with a single index and to retrieve a random index from the body of indexes that have already been written. The times include the client-side merging of index object siblings. It’s clear that although the put times for an object + G-Set Term-Based Inverted Index are roughly double than that of an object with a Secondary Index, the index retrieval times are less than half. This suggests that secondary indexes would be better for write-heavy loads but the G-Set Term-Based Inverted Indexes are much better where the ratio of reads is greater than the number of writes.
Over the length of the test, it is even clearer that G-Set Term-Based Inverted Indexes offer higher performance than Secondary Indexes when the workload of Riak skews toward reads. The use of G-Set Term-Based Inverted Indexes is very compelling even when you consider that the index merging is happening on the client-side and could be moved to the server for greater performance.
- Implement other CRDT Sets that support deletion
- Implement G-Set Term-Based Indexes as a Riak Core application so merges can run alongside the Riak cluster
- Implement strategies for handling large indexes such as term partitioning