July 1, 2011

For most Riak users, Bitcask is the obvious right storage engine to use. It provides low latency, solid predictability, is robust in the face of crashes, and is friendly from a filesystem backup point of view. However, it has one notable limitation: total RAM use depends linearly (though via a small constant) on the total number of objects stored. For this reason, Riak users that need to store billions of entries per machine sometimes use Innostore, (our wrapper around embedded InnoDB) as their storage engine instead. InnoDB is a robust and well-known storage engine, and uses a more traditional design than Bitcask which allows it to tolerate a higher maximum number of items stored on a given host.

However, there are a number of reasons that people may wish for something other than Innostore when they find that they are in this situation. It is less comfortable to back up than bitcask, imposes a higher minimum overhead on disk space, only performs well when both heavily tuned (and given multiple spindles), and comes with a more restrictive license. For all of these reasons we have been paying close attention to LevelDB, which was recently released by Google. LevelDB’s storage architecture is more like BigTable’s memtable/sstable model than it is like either Bitcask or InnoDB. This design and implementation brings the possibility of a storage engine without Bitcask’s RAM limitation and also without any of the above drawbacks of InnoDB. Our early hypothesis after reading the text and code was that LevelDB might fill an InnoDB-like role for Riak users, without some of the downsides. As some of the early bugs in LevelDB were fixed and stability improved, our hopes rose further.

In order to begin testing this possibility, we have begun to perform some simple performance comparisons between LevelDB and InnoDB using basho_bench and a few different usage patterns. All of these comparisons were performed on the exact same machine, a fairly basic 2-CPU Linux server with 4G of RAM, mid-range SATA disks, and so on — a fairly typical commodity system. Note that this set of tests are not intended to provide useful absolute numbers for either database, but rather to allow some preliminary comparisons between the two. We tried to be as fair as possible. For instance, InnoDB was given an independent disk for its journaling.

The first comparison was a sequential load into an empty database. We inserted one hundred million items with numerically-sorted keys, using fairly small values of 100 bytes per item.

The database created by this insert test was used as the starting point for all subsequent tests. Each subsequent test was run in steady-state for one hour on each of the two databases. Longer runs will be important for us to gauge stability, but an hour per test seemed like a useful starting point.

For the second comparison, we did a read-only scenario with a pareto distribution. This means that a minority of the items in the database would see the vast majority of requests, which means that there will be relatively high churn but also a higher percentage of cache hits in a typical system.

The third comparison used exactly the same pareto pattern for key distribution, but instead of being pure reads it was a 90/10 read/write ratio.

The fourth comparison was intended to see how the two systems compared to each other in a very-high-churn setting. It used the same dataset, but write-only, and in an extremely narrow pareto distribution such that nearly all writes would be within a narrow set of keys, causing a relatively small number of items to be overwritten many times.

In each of these tests, LevelDB showed a higher throughput than InnoDB and a similar or lower latency than InnoDB. Our goal in this initial round was to explore basic feasibility, and that has now been established.

This exercise has not been an attempt to provide comprehensive general-purpose benchmarks for either of these two storage engines. A number of choices made do not represent any particular generic usage pattern but were instead made to quickly put the systems under stress and to minimize the number of variables being considered. There are certainly many scenarios where either of these two storage systems can certainly be made to perform differently (sometimes much better) than they did here. In some earlier tests, we saw InnoDB provide a narrower variance of latency (such as lower values in the 99th percentile) but we have not seen that reproduced in this set of tests. Among the other things not done in this quick set of tests: using the storage engines through Riak, deeply examining their I/O behavior, observing their stability over very long periods of time, comparing their response to different concurrency patterns, or comparing them to a wider range of embedded storage engines. All of these directions (and more) are good ideas for continued work in the future, and we will certainly do some of that.

Despite everything we haven’t yet done, this early work has validated one early hope and hypothesis. It appears that LevelDB may become a preferred choice for Riak users whose data set has massive numbers of keys and therefore is a poor match with Bitcask’s model. Performance aside, it compares favorably to InnoDB on other issues such as permissive license and operational usability. We are now going ahead with the work and continued testing needed to keep exploring this hypothesis and to improve both Riak and LevelDB in order to make their combined use an option for our customers and open source community.

Some issues still remain that we are working to resolve before LevelDB can be a first-class storage engine under Riak. One such issue that we are working on (with the LevelDB maintainers at Google) is making the LevelDB code portable to all of the same platforms that Riak is supported on. We are confident that these issues will be resolved in the very near future. Accordingly, we are moving ahead with a backend enabling Riak’s use of LevelDB.