Get Riak– the world’s best open-source distributed database! Contact Us
| General | |
|---|---|
| Architecture | Masterless distributed database with an emphasis on predictability and availability. |
| Languages Used | Written in Erlang and C. |
| Clients | Basho supported Drivers are available for Erlang, Python, Java, PHP, Javascript, and Ruby. Take a more in depth look at the Client Libraries |
| History | Riak is based on technology originally developed by Basho Technologies to run a sales automation business. There was more interest in the database technology than the applications built on it so Basho decided to build a business around Riak itself. |
| Influences | Riak is heavily influenced by Dr. Eric Brewer's CAP Theorem and Amazon's Dynamo paper. Some of the original team comes from Akamai which also explains their focus on operational ease and fault tolerance. |
| Data Storage | |
| Buckets and Keys | Buckets and keys are the primary way to organize data inside of Riak. User data is stored and referenced by bucket/key pairs. |
| Links and Metadata | Riak objects can store links pointing to other entries. These links can be walked via Riaks HTTP interface directly or as part of a map/reduce job. Riak objects and buckets can store user-defined metadata in addition to the actual data payload. The metadata is exposed as HTTP headers in the HTTP interface and as associative arrays inside map/reduce jobs. |
| Local Disk Storage | Bitcask is the default backend for Riak. Bitcask is a simple yet powerful local key/value store that serves as Riaks low latency, high throughput storage backend. |
| Pluggable Backends | Riak uses an API to interact with its storage subsystem. The API allows Riak to support multiple backends which can be plugged in as needed. Riak currently ships with multiple backends including Bitcask, LevelDB and Embedded InnoDB. You can read more about configuring backends on the storage_backends section of the Configuration Files page. |
| Clustering | |
| Nodes, Vnodes, and Partitions | Central to any Riak cluster is a 160-bit integer space which is divided into equally-sized partitions. Physical servers, referred to in the cluster as nodes, run a certain number of virtual nodes, or vnodes. Each vnode will claim a partition on the ring. The number of active vnodes is determined by the number of physical nodes in the a cluster at any given time. As a rule, each node in the cluster is responsible for 1/(total number of physical nodes) of the ring. You can determine the number of vnodes on each node by calculating (number of partitions)/(number of nodes). More simply put, a ring with 32 partitions, composed of four physical nodes, will have approximately eight vnodes per node. This setup is represented in the diagram below. Nodes can be added and removed from the cluster dynamically and Riak will redistribute the data accordingly. |
| Distribution | Riak is designed, from the ground up, to run in a distributed environment. Core operations, such as read/writing data and executing map/reduce jobs, actually become faster when more Riak nodes are added to a cluster. |
| Riak & CAP Thereom |
One principle that influenced Riak’s design is Dr. Eric Brewers CAP Theorem. This principle describes three desired properties: Consistency, Availability, and Partition tolerance. The theorem states you can only guarantee having two of the three properties at any time. Riak chooses to emphasize on the A and P of CAP. This choice makes Riak “eventually consistent.” However this doesn’t mean “inconsistent” but rather that Riak will allows some subsystems to change before others during failure scenarios– instead of failing entirely. |
| Master-less Configuration | All nodes in a Riak cluster are equal, with no special roles. Each node contributes to the storage capacity and each node is fully capable of serving any client request. |
| Bucket Limitations |
Riak communicates bucket information around the cluster using a gossip protocol. In general, large numbers of buckets within a Riak cluster is not a problem. In practice, there are two potential restrictions on the maxmimum number of buckets a Riak cluster can handle. First, buckets which use a non-standard set of properties will force Riak to gossip more data around the cluster. The additional data can slow processing and place an upper limit on performance. Second, some backends, namely Innostore, store each bucket as a separate entity. This can cause a node to run out of resources such as file handles. These resource restrictions might not impact performance but they can represent another limit on the maximum number of buckets. |
| Replication |
Riak controls how many replicas of data are kept via a setting called the N value. This value has a default that can be overridden on each bucket. Riak objects inherit the N value of their parent bucket. For example, here is a depiction of what happens when n_val = 3 (This is the default setting). When you store a datum in a bucket with an N value of three, the datum will replicated to three separate partitions on the Riak Ring. Configuring Replication in Riak
|
| Hinted Handoff | Riak uses a technique called hinted handoff to compensate for failed nodes in a cluster. Neighbors of a failed node will pick up the slack and perform the work of the failed node allowing the cluster to continue processing as usual. This can be considered a form of self-healing. |
| Dynamic Growth | Riak will automatically re-balance data as nodes join and leave the cluster. |
| Data Reading | |
| Fetching | Riak objects can be fetched directly if the client knows the bucket and key. This is the fastest way to get data out of Riak. |
| R Value | Riak allows the client to supply an R value on each direct fetch. The R value represents the number of Riak nodes which must return results for a read before the read is considered successful. This allows Riak to provide read availability even when nodes are down or laggy. |
| Read Failure Tolerance | Subtracting R from N will tell you the number of down or laggy nodes a Riak cluster can tolerate before becoming unavailable for reads. For example, an 8 node cluster with an N of 8 and a R of 1 will be able to tolerate up to 7 nodes being down before becoming unavailable for reads. |
| Link Walking | Riak can also return objects based on links stored on the object. Link walking can be used to return a set of related objects from a single request. |
| Writing and Updating Data | |
| Vector clocks | Each update to a Riak object is tracked by a vector clock. Vector clocks allow Riak to determine causal ordering and detect conflicts in a distributed system. Vector Clocks In Depth |
| Conflict Resolution | Riak has two ways of resolving update conflicts on Riak objects. Riak can allow the last update to automatically win or Riak can return both versions of the object to the client. This gives the client the opportunity to resolve the conflict on its own. |
| W Value | Riak’s API allows the client to supply a W value on each update. The W value represents the number of Riak nodes which must report success before an update is considered complete. This allows Riak to provide write availability even when nodes are down or laggy. |
| Write failure tolerance | To ensure read-your-writes consistency, users should usually choose their values such as R + W > N. This will ensure that a successful write will be reflected in any subsequent read. |
| Querying and MapReduce | |
| Querying | Querying Riak may also be performed by map/reduce style. Riak has a flexible and targeted map/reduce mechanism which more closely resembles Hadoop than CouchDB, but is not quite the same as either. |
| Phases | Map/Reduce jobs are chains of functions. Riak combines the function along with information about what to do with the output and the language the function is written into a phase. The output of one phase becomes the input for the next. Riak can accumulate the output of any phase or set of phases and return the data to the client as part of that jobs output. |
| Map Phase | Map phases are responsible for gathering data from Riak bucket/key entries. Map phases can operate on entire buckets or lists of bucket/key pairs. The fetching of data is highly parallel and scales with the number of nodes in the cluster. Riak caches the results of map fetches to reduce the load on the storage backends. This means subsequent map fetches for the same data will be faster after the initial fetch. |
| Link Phase | Link phases are a specialized version of map phases. A link phase fetches Riak objects based on a link walk. Link phases can be used to perform map/reduce processing on sets of related objects. |
| Reduce Phase | Reduce phases perform arbitrary processing on the data retrieved from prior phases. Unlike Map phases, a Reduce phase will block until complete so it can perform aggregation functions over the entire set of incoming data. |
| Query Languages | Map/Reduce queries can be written in Erlang or Javascript. When using Erlang, Erlang functions have complete access to the Riak Erlang API. When using Javascript, Mozillas Spidermonkey engine provides the runtime environment. Pre-defined Javascript functions run almost as fast as Erlang functions. Javascript functions are currently permitted read-only access to Riak.Watch a screencast overview of MapReduce in Riak |
| The Riak API | |
| Data Storage |
The guys who wrote Riak are also responsible for the Erlang REST framework Webmachine, so its not surprising Riak uses REST for its API. Storage operations use HTTP PUTs or POSTs and fetches use HTTP GETs. Storage operations are submitted to a pre-defined URL which defaults to /riak. Clients can set the Content-Type header for each Riak object. Riak will replay the header when the object is fetched. This is nice since it allows Riak to be content agnostic. Take an in depth look at Riak’s HTTP API |



