November 14, 2013
This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.” The previous post in the series was Relational to Riak – High Availability.
Riak is designed for scalability, which truly separates it from relational systems. As described in the previous post, relational databases run best on a single server. If the dataset grows beyond the capacity of this single machine, it can become prohibitively expensive (or even impossible) to simply upgrade to a bigger machine. In such a scenario, the only option may be to add more machines and divide the dataset across them using a technique called sharding.
Sharding divides data into logical parts (such as alphabetical, by customer, or by geographic region) that can be distributed across multiple machines – often manually. If data continues to grow, this process may need to be repeated at great expense.
Sharding is not only difficult, it also will typically lead to hot spots – meaning certain machines are responsible for storing and serving a disproportionately high amount of both data and requests. Hot spots can cause unpredictable latency and degraded performance.
(And remember all the ways in which availability is a challenge? Combine sharding with a master/slave architecture for maximal expense and general unpleasantness.)
Instead of sharding, Riak evenly distributes data across a cluster using consistent hashing. In a Riak cluster, the data space is divided into partitions which are claimed by the servers. When new data is written to the database, these objects are evenly placed around the ring and replicated 3 times (by default). This ensures that your data will always be available, even when nodes fail.
When nodes are added or removed, data is rebalanced automatically. New machines assume ownership of some of the partitions and existing machines hand off relevant partitions and associated data until data ownership is equal amongst nodes.
By eliminating the manual requirements of sharding and making hot spots highly unlikely, Riak makes it significantly easier for companies to scale, whether it’s just for a few months to handle peak loads or to support long-term growth strategies.
January 14, 2013
This is the second in a series of blog posts that discusses a high-level overview of the benefits and tradeoffs of Riak versus traditional relational databases. If this is relevant to your projects or applications, register for our “From Relational to Riak” webcast on Thursday, January 24.
One critical factor in deciding which database to use is its operational profile. Many customers today are dealing with rapid data growth, intense peak loads and the imperative to maintain economies of scale across a large platform. For these customers, how the database scales up and what impact that has on operations is a huge factor in business and technical decisions around what technology to use.
The cost of scale is one reason why many of our users and customers have picked Riak over a traditional relational system. From experience, users have discovered that scaling a relational system can be expensive, error-prone and lead to significant and disruptive operations projects. In this blog, we’ll take a look at how a relational database’s sharding approach differs from Riak’s consistent hashing approach and what that means for you as an operator.
Historically, relational databases were commonly found running in production on a single server. If capacity and availability needs require more than a single machine, relational databases address scale using a technique called sharding. Sharding breaks data into logical parts (such as alphabetically, numerically or by geographic region) that can be distributed across multiple machines. A simplified example is below.
This approach can be problematic for several reasons. First, writing and maintaining sharding logic increases the overhead of operating and developing an application on the database. Significant growth of data or traffic typically means significant, often manual, resharding projects. Determining how to intelligently split the dataset without negatively impacting performance, operations, and development presents a substantial challenge– especially when dealing with “big data”, rapid scale, or peak loads. Further, rapidly growing applications frequently outpace an existing sharding scheme. When the data in a shard grows too large, the shard must again be split. While several “auto”-sharding technologies have emerged in recent years, these methods are often imprecise and manual intervention is standard practice. Finally, sharding can often lead to “hot spots” in the database – physical machines responsible for storing and serving a disproportionately high amount of both data and requests – which can lead to unpredictable latency and degraded performance.
To avoid sharding (and the associated expenses), data in Riak is distributed across nodes using consistent hashing. Consistent hashing ensures data is evenly distributed around the cluster and new nodes can be added with automatic, minimal reshuffling of data. This significantly decreases risky “hot spots” in the database and lowers the operational burden of scaling.
How does consistent hashing work? Riak stores data using a simple key/value scheme. These keys and values are stored in a namespace called a bucket. When you add new key/value pairs to a bucket in Riak, each object’s bucket and key combination is hashed. The resulting value maps onto a 160-bit integer space. You can think of this integer space as a ring used to figure out what data to put on which physical machines.
How? Riak divides the integer space into equally-sized partitions (default is 64). Each partition owns the given range of values on the ring, and is responsible for all buckets and keys that, when hashed, fall into that range. Each partition is managed by a process called a virtual node (or “vnode”). Physical machines in the cluster evenly divide responsibility for vnodes. Each physical machine thus becomes responsible for all keys represented by its vnodes.
When nodes are added or removed, data is rebalanced automatically without any operator intervention. New machines assume ownership of some of the partitions and existing machines hand off relevant partitions and associated data until data ownership is equal amongst nodes. Riak also has an elegant approach to making cluster changes such as adding or removing nodes, allowing you to stage up the changes, view the impact on the cluster, and then choose to commit or abort the changes. Developers and operators don’t have to deal with the underlying complexity of what data lives where as all nodes can serve and route requests. By eliminating the manual requirements of sharding and much of the potential for “hot spots,” Riak provides a much simpler operational scenario for many users that lets them add and remove machines as needed, no matter how much they grow.
October 1, 2012
New to Riak? Join us this Thursday for an intro to Riak webcast with Mark Phillips, Basho director of community, and Shanley Kane, director of product management. In this 30 minute talk we’ll cover the basics of:
Good and bad use cases for Riak
- Some user stories of note
- Riak’s architecture: consistent hashing, hinted handoff, replication, gossip protocol and more
- APIs and client libraries
- Features for searching and aggregating data: Riak Search, Secondary Indexes and Map Reduce
- What’s new in the latest version of Riak
Register for the webcast here.
**October 22, 2012**
If you missed our last ‘Intro to Riak’ webcast, not to fear, we’re doing another. This Thursday (11am PT / 2pm ET) , join Shanley, Basho’s director of product marketing, and Mark, director of coumminity, for a 30 minute webcast introducing Riak’s architecture, use cases, user stories, operations and data model.
Register for the webcast [here](http://info.basho.com/IntroToRiakOct25.html).
**January 10, 2013**
Join us on Thursday, January 17 at 11 am PT / 2 pm ET for an intro to Riak webcast with Shanley Kane, director of product management and Mark Phillips, director of community management.
In 30 minutes, we’ll cover:
* Good and bad use cases for Riak
* User stores of note, including social, content, advertising and session storage
* Riak’s architecture and mechanisms for remaining highly available in failure conditions
* APIs, data model and client libraries
* Features for querying and searching data
* What’s new in the latest version of Riak and what’s next
[Sign up for the webcast here](http://info.basho.com/IntroToRiakJan17.html).
July 28, 2010
Sometimes you need more than words to illustrate a point. Here is Basho’s humble attempt to clarify the difference between “Dynamo-Style” systems (like Riak) that use consistent hashing to achieve fault tolerance, simple scaling, and prevent data loss, and systems that use techniques like sharding.