March 17, 2015
As a Solutions Architect for Basho, I’m often called upon by customers to explain Riak. Frequently this is in the context of a specific use-case they are deploying, or a problem they are attempting to solve. In each of these cases, an important component of the conversation is ensuring a base level of understanding of the design principles in Riak’s distributed architecture. It is this architecture, and these early design decisions that ensure that Riak “just works” and why companies choose Riak for their business critical applications.
This blog highlights key reasons why Riak “just works” from a high availability, fault tolerant, data distribution, predictable performance and operational simplicity perspective. It also shows how Riak addresses data replication, detection of node failures, read latency, and rolling restarts due to upgrades, failures or operating system related issues.
Uniform Data Distribution and Predictable Performance
If you have ever seen a presentation about Riak, you have seen an inclusion of the “ring diagram.” The image is fairly simple and one that we use to describe data distribution, scalability, and performance characteristics of Riak. But the underlying architectural implementation that this ring represents is an important characteristic of Riak’s architecture.
Riak employs a SHA1 based consistent hashing algorithm that is mathematically proven to produce a perfectly uniform distribution about a 160-bit space (the ring). This 160-bit space is divided into equal partitions called “vnodes.” These vnodes, in turn, are evenly distributed amongst participating physical nodes in the cluster. Uniform distribution about the 160-bit space and an even allocation of vnodes amongst physical nodes ensures keys are uniformly distributed about the cluster. Participating nodes in a Riak cluster are homogeneous – meaning any node can service any request – and due to the nature of consistent hashing, every node in the cluster knows where data should reside within the cluster.
Riak’s default behavior is to seek equilibrium. Each node in a cluster is responsible for 1/nth data within the cluster and 1/nth total performance, where n represents total node count in a cluster. This architectural principle allows operators to make reasonable assumptions about Riak’s linear scalability. Often in sharded systems, access patterns that disproportionately access specific ranges of data will cause “hot spots” in a cluster making predictable operations more difficult to maintain. Uniform data distribution about the cluster (and Hinted Handoff described below) allows for continuous normal operations of an active Riak cluster while maintaining a predictable level of performance as any node is removed from the cluster for any reason. Even in resource-starved virtual environments, a Riak cluster will work to maintain its equilibrium and equal data distribution such that operators may assemble larger numbers of slower individual virtual machines to achieve their desired performance profile in aggregate. Additionally, each physical node in a Riak cluster maintains its own performance statistics that are easily accessible and parsable such that an advanced deployment would be able to wrap those statistics into a sick-node detection algorithm based on their own specific thresholds.
Furthering the predictable performance rationale, Riak uses vector clocks, specifically dotted version vectors in Riak 2.0, to internally reason about conflict resolution as it relates to multiple updates to the same object. Updates to any object are independent of updates to any other object and thus there is no locking in Riak for any of its core operations – reads, writes and deletes. There are no global or local locks on any tables or rows – those structures do not exist in Riak. Locks introduce nondeterministic delays in performance yet are often a necessary component of any absolutely consistent database. Riak on the other hand is an eventually consistent database (caveat Riak 2.0’s strong consistency feature at the expense of availability). Additionally, because the base level of abstraction in Riak is the vnode, and data is replicated amongst a set of vnodes operating on distinct physical hardware, background processes such as file compaction is scheduled in such a way that it only affects a segment of the cluster at any given time. This rolling compaction ensures a high degree of availability with minimal performance penalty.
High Availability and Failure Recovery
The measure of a distributed system is not how well that system runs under optimal conditions in the general case, but rather how that system performs in the face of failure. An architect must ask herself how well her system will perform in the face of node failure and how well that system will recover from failure. Failures happen with increasing frequency as the size of your system grows. Riak implements a number of technologies that when combined ensure Riak excels at failure recovery. These technologies provide a baseline set of features that allow Riak to quickly recover from failure scenarios with minimal operator intervention. Not only will these built-in recovery mechanisms maintain eventual consistency within the database but they will also maintain synchronicity with features such as Solr’s full text indexing and Multi Datacenter Replication.
Hinted Handoff ensures data is replicated an appropriate number of times in spite of failure by allowing a node to take over responsibility for a vnode and then return that data to its original “owner.” Handoff from one vnode to another can happen on a temporary (in the case of failure) or permanent (in the case of cluster resizing) basis. In both cases, Riak handles this automatically while the cluster remains available. Because Riak is able to dynamically allocate vnode assignments continuously, the cluster can absorb the loss of any physical node(s) for write operations and ensure availability of data as long as one vnode of any replica set is still accessible. This allows Riak to maintain availability when a node is removed from the cluster for any reason – whether scheduled or not. Ultimately, an unscheduled failure or a scheduled upgrade of Riak or the operating system results in the removal of a node from the cluster – Riak’s core architecture and capabilities accommodate this behavior with minimal operator intervention.
Read Repair is a mechanism triggered on a successful read of a value where all replicas may not agree on the value. There are two possibilities where this can happen: when one replica returns not found – meaning it doesn’t have a copy, and when one replica returns a value where the vector clocks is an ancestor of the successful read. When this situation occurs, Riak will force the errant nodes to update the object’s value based on the value of the successful read.
Finally, Active Anti Entropy corrects inconsistent data continuously in the background. Where Read Repair corrects data at read time, AAE corrects all data regardless of whether or not it is actively accessed by running a background process that continuously checks for inconsistencies. AAE uses Merkle Tree hash exchanges between vnodes to look for these inconsistencies. When a difference at the top of the tree is detected, Riak recursively checks the tree until it finds the exact values with a difference between vnodes and then sends the smallest amount of data necessary to regain equilibrium.
Architecture Makes a Difference
Riak’s core architecture and key technical features provide the building blocks valued in a highly distributed environment. Uniform data distribution, homogeneous nodes, hinted handoff, read repair and active anti entropy all play a role in providing the high availability, fault tolerance, predictable performance and operational simplicity developers and managers are looking for from their non-relational persisted data solutions.
We’ve seen Riak adopted across a wide variety of verticals and for a broad range of use cases. From gaming to retail to advertising to mobile, our customers begin by identifying a workload, or use case, where availability, scalability, and fault tolerance are critical. We then work closely with these customers to ensure Riak is an ideal fit for the architectural design and business requirements. This process often begins with a Tech Talk. Someone like me, working with you either onsite or remotely, to assess how Riak can help you solve your critical business requirements. You can sign up for a Tech Talk here.
February 17, 2015
According to TechTarget, a common definition of “High Availability” is:
“In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to “100% operational” or “never failing.”
The reality is that this phrase has become semantically overloaded by its inclusion in marketing copy across a disparate set of technologies. Much like “Big Data”, perspectives on availability vary based on industry and customer expectation.
For many of today’s applications and platforms, high availability has a direct impact on revenue. A few examples include: cloud services, online retail, shopping carts, gaming and betting, and advertising. Further, lack of availability can damage user trust and result in a poor user experience for many social media and chat applications, websites, and mobile applications. Riak provides the high availability needed for your critical applications.
Availability – By the Numbers
As we highlighted in an infographic entitled Down with Downtime, more than 95% of businesses with 1,000+ employees estimate that they lose more than $100,000 for every 1 hour of downtime. For more than 1 in 2 large businesses, the cost of downtime amounts to more than $300,000 per hour. At the lower end of this scale, this is $83 dollars per minute. At the upper end of the spectrum (in financial services) it can amount to $1,800 a second of downtime.
This fiscal impact has resulted in availability being measured as a percentage calculation of uptime in a given year. This percentage is often referred to as the “number of 9s” of availability. For example, “one nine” of availability equates to 90% uptime in a year. Similarly, “five nines” (the standard that was set by consulting firms on enterprise projects) equates to 99.999% availability in a year. While that percentage is often referenced, the practical reality is that it means there can be no more than 6.05 seconds of unplanned downtime per week.
Availability – A Feature or A Benefit?
Often, when describing Riak, I begin by explaining the benefits of Riak (availability, scalability, fault tolerance, operational simplicity) and then discuss, in detail, the properties that these benefits are derived from. Availability is not something that can be added to a system (be it a distributed database or otherwise), rather it is an outcome of the core architectural decisions that were made in the development of the product.
Consider, for example, the AXD 301 ATM switch. It, reportedly, delivers at or better than “nine nines” (99.9999999%) of availability to customers. This is a staggering number that requires NO MORE than 6.048 milliseconds of downtime per week. Interestingly, it shares a common architectural component with Riak also being developed in Erlang.
“How does Riak achieve high availability?” Or, perhaps better stated as, “What are the architectural decisions made in Riak that enable high availability?”
Availability – An Architectural Decision
Riak is a masterless system designed for high availability, even in the event of hardware failures or network partitions. Any server (termed a “node” in Riak) can serve any incoming request and all data is replicated across multiple nodes. If a node experiences an outage, other nodes will continue to service read and write requests. Further, if a node becomes unavailable to the rest of the cluster, a neighboring node will take over the responsibilities of the missing node. The neighboring node will pass new or updated data (termed “objects”) back to the original node once it rejoins the cluster. This process is called “hinted handoff” and it ensures that read and write availability is maintained automatically to minimize your operational burden when nodes fail or comes back on-line.
More information about the architectural decisions involved in Riak’s design are available in our documentation. In particular, the Concepts – Clusters section is deeply illustrative.
Availability – A Use Case
Consider, for example the implementation of Riak at Temetra. Temetra has thousands of users and millions of meters that create billions of data points. The massive influx of data that was being generated quickly became difficult to manage with the company’s legacy SQL database. When considering how this structured database could be overhauled, Temetra conducted evaluations with Cassandra and Hadoop but ultimately chose Riak due to its high availability, relatively self-maintaining and easy to deploy infrastructure. It is essential that the data collected from the meters is always available as it is relied on to determine correct billing for Temetra’s customers.
Availability – A Summary
The reality is that a database, even a distributed, masterless, multi-model platform like Riak, is only one component of the application stack. Understanding your availability requirements requires deep knowledge of the entirety of the deployment environment. “High Availability” cannot be retrofit into a system. Rather it requires conscious effort in the early stages to ensure that customer requirements are met and that downtime does not result in lost customers and lost revenue.
January 7, 2014
Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.
Thus, this looks at some of the common conceptual challenges.
Riak offers query features such as secondary indexes (2i), MapReduce, and full-text search, but throwing a large quantity of data into Riak and expecting those tools to find whatever you need is setting yourself (and Riak) up to fail. Performance will be poor, especially as you scale.
Reads and writes in Riak should be as fast with ten billion values in storage as with ten thousand.
Key/value operations seem primitive (and they are) but you’ll find they are flexible, scalable, and very fast (and predictably so).
Treat 2i and friends as tools to be applied judiciously, design the main functionality of your application as if they don’t exist, and your software will continue to work at blazing speeds when you have petabytes of data stored across dozens of servers.
Normalizing data is generally a useful approach in a relational database, but unlikely to lead to happy results with Riak.
Riak lacks foreign key constraints and join operations, two vital parts of the normalization story, so reconstructing a single record from multiple objects would involve multiple read requests; certainly possible and fast enough on a small scale, but not ideal for larger requests.
Instead, imagine the performance of your application if most of your requests were a single, trivial read. Preparing and storing the answers to queries you’re going to ask later is a best practice for Riak.
Ducking Conflict Resolution
One of the first hurdles Basho faced when releasing Riak was educating developers on the complexities of eventual consistency and the need to intelligently resolve data conflicts.
Because Riak is optimized for high availability, even when servers are offline or disconnected from the cluster due to network failures, it is not uncommon for two servers to have different versions of a piece of data.
The simplest approach to coping with this is to allow Riak to choose a winner based on timestamps. It can do this more effectively if developers follow Basho’s guidance on sending updates with vector clock metadata to help track causal history, but often concurrent updates cannot be automatically resolved via vector clocks, and trusting server clocks to determine which write was the last to arrive is a terrible conflict resolution method.
Even if your server clocks are magically always in sync, are your business needs well-served by blindly applying the most recent update? Some databases have no alternative but to handle it that way, but we think you deserve better.
Riak 2.0, when installed on new clusters, will default to retaining conflicts and requiring the application to resolve them, but we’re also providing replicated data types to automate conflict resolution on the servers.
If you want to minimize the need for conflict resolution, modeling with as much immutable data as possible is a big win.
For years, functional programmers have been singing the praises of immutable data, and it confers significant advantages when using a distributed data store like Riak.
Most obviously, conflict resolution is dramatically simplified when objects are never updated.
Even in the world of single-server database servers, updating records in place carries costs. Most databases lose all sense of history when data is updated, and it’s entirely possible for two different clients to overwrite the same field in rapid succession leading to unexpected results.
Some data is always going to be mutable, but thinking about the alternative can lead to better design.
SELECT * FROM <table>
A perfectly natural response when first encountering a populated database is to see what’s in it. In a relational database, you can easily retrieve a list of tables and start browsing their records.
As it turns out, this is a terrible idea in Riak.
Riak is optimized for unstructured, opaque data; however, it is not designed to allow for trivial retrieval of lists of buckets (very loosely analogous to tables) and keys.
Doing so can put a great deal of stress on a large cluster and can significantly impact performance.
It’s a rather unusual idea for someone coming from a relational mindset, but being able to algorithmically determine the key that you need for the data you want to retrieve is a major part of the Riak application story.
Because Riak sends multiple copies of your data around the network for every request, values that are too large can clog the pipes, so to speak, causing significant latency problems.
Basho generally recommends 1-4MB objects as a soft cap; larger sizes are possible with careful tuning, however.
For significantly larger objects, Riak CS offers an Amazon S3-compatible (and also OpenStack Swift-compatible) key/value object store that uses Riak under the hood.
Running a Single Server
This is more of an operations anti-pattern, but it is a common misunderstanding of Riak’s architecture.
It is quite common to install Riak in a development environment using its
devrel build target, which creates five full Riak stacks (including Erlang virtual machines) to run on one server to simulate a cluster.
However, running Riak on a single server for benchmarking or production use is counterproductive, regardless of whether you have one stack or five on the box.
It is possible to argue that Riak is more of a database coordination platform than a database itself. It uses Bitcask or LevelDB to persist data to disk, but more importantly, it commonly uses at least 64 such embedded databases in a cluster.
Needless to say, if you run 64 databases simultaneously on a single filesystem you are risking significant I/O and CPU contention unless the environment is carefully tuned (and has some pretty fast disks).
Perhaps more importantly, Riak’s core design goal, its raison d’être, is high availability via data redundancy and related mechanisms. Writing three copies of all your data to a single server is mostly pointless, both contributing to resource contention and throwing away Riak’s ability to survive server failure.
So, Now What?
As always, we recommend visiting Basho’s docs website for more details on how to build and run Riak, and many of our customers have given presentations on their use cases of Riak, including data modeling.
Also, keep an eye on the Basho blog where we provide high-level overviews like this of Riak and the larger non-relational database world.
For a detailed analysis of your needs and modeling options, contact Basho regarding our professional services team.
- Why Riak (docs.basho.com)
- Data Modeling (docs.basho.com)
- Clocks Are Bad, Or, Welcome to the Wonderful World of Distributed Systems (Basho blog)
- A Little Riak Book
January 6, 2014
With the launch of the Technical Preview of Riak 2.0, we also announced the addition of strong consistency to Riak. This addition fundamentally changes how Riak can be used, since all previous versions classified Riak as an eventually consistent system.
With Riak 2.0, developers now have the flexibility to choose whether buckets should be highly available or strongly consistent, based on data requirements. Consistency preferences are defined on a per bucket type basis, in the same cluster.
At RICON West 2013, Basho senior engineer, Joseph Blomstedt, gave an updated version of his “Bringing Consistency to Riak” talk. The original talk (presented at RICON West 2012) discussed the challenges, motivations, and high-level plans of bringing consistency to Riak. This updated version presents the actual implementation that has since been built and how it will function in Riak 2.0. Both talks are available below.
To start testing the strong consistency feature, you can download the Technical Preview of Riak 2.0 here.
To watch all of the sessions from RICON West 2013, visit the Basho Technologies Youtube Channel.
November 18, 2013
This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.” The previous post in the series discussed High Availability and Cost of Scale.
In order to provide high availability, which is a cornerstone of Riak’s value proposition, the database stores several copies of each key/value pair.
This availability requirement leads to a fundamental tradeoff: in order to continue to serve requests in the presence of failure, we do not force all data in the cluster to stay in sync. Riak will allow writes and reads no matter how many servers (and their stored replicas) are offline or otherwise unreachable.
(Incidentally, this lack of strong coordination has another consequence beyond high availability: Riak is a very, very fast database.)
Riak does provide both active and passive self-healing mechanisms to minimize the window of time during which two servers may have different versions of data.
The concept of eventual consistency may seem unfamiliar, but if you’ve ever implemented a cache or used DNS, those are common examples of the idea. In a large enough system, it’s effectively the default state of all data.
However, with the forthcoming release of Riak 2.0, operators will be able to designate selected pieces of data to require coordination and maintain strong consistency over high availability. Writing such data will be slower and subject to failure if too many servers are unreachable, but the overall robust architecture of Riak will still provide a fast, highly available solution.
Riak stores data using a simple key/value model, which offers developers tremendous flexibility to define access models that suit their applications. It is also content-agnostic, so developers can store arbitrary data in any convenient format.
Instead of forcing application-specific data structures to be mapped into (and out of) a relational database, they can simply be serialized and dropped directly into Riak. For records that will be frequently updated, if some of the fields are immutable and some aren’t, we recommend keeping the immutable data in one key/value pair and the rest organized into a single or multiple objects based on update patterns.
Relational databases are ingrained habits for many of us, but moving beyond them can be liberating. Further information about data modeling, including sample configurations, are available on Use Cases section of the documentation.
One tradeoff with this simpler data model is that there is no SQL or SQL-like language with which to query the data.
To achieve optimal performance, it is advisable to take advantage of the flexibility of the key/value model to define simple retrieval patterns. In other words, determine the most useful queries and write the results of those queries as the data is being processed.
Because it is not always possible to know in advance what questions will need to be asked of your data, Riak offers added functionality on top of the key/value model. Tools such as Riak Search (a distributed, full-text search engine), Secondary Indexing (ability to tag objects with queryable metadata), and MapReduce (leveraged for aggregation tasks) are available to perform ad hoc queries as needed.
For many users, the tradeoffs of moving to Riak are worthwhile due to the overall benefits; however, it can be a bit of an adjustment. To see why others have chosen to switch to Riak from both relational systems and other NoSQL databases, check out our Users Page.
November 13, 2013
This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.”
One of the biggest differences between Riak and relational systems is our focus on availability. Riak is designed to be deployed to, and runs best on, multiple servers. It can continue to function normally in the presence of hardware and network failures. Relational databases, conversely, are simplest to set up on a single server.
Most relational databases offer a master/slave architecture for availability, in which only the master server is available for data updates. If the master fails, the slave is (hopefully) able to step in and take over.
However, even with this simple model, coping with failure (or even properly defining it) is non-trivial. What happens if the master and slave server cannot talk to each other? How do you recover from a split brain scenario, where both servers think they’re the master and accept updates? What happens if the slave is slow to respond to updates sent from the master database? Can clients read from a slave? If so, does the master need to verify that the slave has received all updates before it commits them locally and responds to the client that requested the updates?
Conversely, Riak is explicitly designed to expect server and network failure. Riak is a masterless system, meaning any server can respond to read or write requests. If one fails, others will continue to service client requests. Once this server becomes available again, the cluster will feed it any updates that it missed through a process we call hinted handoff.
Because Riak’s system allows for reads and writes when multiple servers are offline or otherwise unreachable, data may not always be consistent across the environment (usually only for a few milliseconds). However, through self-healing mechanisms like read repair and Active Anti-Entropy, all updates will propagate to all servers making data eventually consistent.
For many use cases, high availability is more important than strict consistency. Data unavailability can negatively impact revenue, damage user trust, lead to poor user experience, and cause lost critical data. Industries like gaming, mobile, retail, and advertising require always-on availability. Visit our Users Page to see how companies in various industries use Riak.
July 2, 2013
We use the Erlang/OTP programming language in building our products here at Basho. We made that choice consciously, believing that it would be a tradeoff – significant benefits balanced by a handful of costs. I am often asked if we would make the same choice all over again. To answer that question I need to address the tradeoff we thought we were making.
The single most compelling reason to choose Erlang was the attribute for which it is best known: extremely high availability. The original design goal for Erlang was to enable rapid development of highly robust concurrent systems that “run forever.” The poster child of its success (outside Riak, of course) is the AXD 301 ATM switch, which reportedly delivers at or better than “nine nines” (99.9999999%) of uptime to customers. Since when we set out to build a database for applications requiring extremely high availability, Erlang was a natural fit.
We knew that Erlang’s supervisor concept, enabling a “let it crash” program designed for resilience, would be a big help for making systems that handle unforeseen errors gracefully. We knew that lightweight processes and a “many-small-heaps” approach to garbage collection would make it easier to build systems not suffering from unpredictable pauses in production. Those features paid off exactly as expected, and helped us a great deal. Many other features that we didn’t understand the full importance of at the time (such as the ability to inspect and modify a live system at run-time with almost no planning or cost) have also helped us greatly in making systems that our users and customers trust with their most critical data.
It turns out that our assessment of the key trade-off — a more limited pool of talented engineers — is, in practice, not a problem for a company like Basho. We need to hire great software developers, and we tend to look for ones with particular skills in areas like databases and/or distributed systems. If someone is a skilled programmer in relatively arcane disciplines like those, then the ability to learn a new programming language will not be daunting. While it’s theoretically a nice bonus for someone to bring knowledge of all the tools we use, we’ve hired a significant number of engineers that had no prior Erlang experience and they’ve worked out well.
This same purported drawback is a benefit in some ways. By not just looking for “X Engineers” (where X is Java, Erlang, or anything else), we make a statement both about our own technology decision-making process and the expected levels of interesting work at Basho. To help me work on my house, I’d rather have someone who self-identifies as an “expert carpenter” or “expert plumber,” not “expert hammer wielder,” even in the cases where most of the job might involve that tool. We expect developers at Basho to exercise deep, broad interests and expertise, and for them to do highly creative work. When we mention Erlang and the other thoughtful decisions we made in building our products, they value the roadmap and leadership.
I had an entertaining and ironic conversation about this recently with a manager at a large database company. He explained to me that we had clearly made the wrong choice, and that we should have chosen Java (like his team) in order to expand the recruiting pool. Then, without breaking a stride, he asked if I could send any candidates his way, to fill his gaps in finding talented people.
We continue to grow and to bring on great new engineers.
That’s not to say that there are no downsides. Any language, runtime, and community will bring with it different constraints and freedoms, making some tasks easier and others less so. We’ve done some work over the years to participate in the highly supportive Erlang community. But the big organizational weakness that so many people thought would come with the choice? It’s simply not a problem.
That lesson, combined with the ongoing technical advantages we enjoy because of Erlang, makes it easy to answer the question:
Yes, we would absolutely choose Erlang today.