April 23, 2014
Traditional database architectures were the default option for many pre-Internet use cases and architectures, such as MySQL, remain common today. However, these traditional solutions have limits that quickly become apparent as companies (and data) grow. Modern companies have changing priorities: downtime (planned or unplanned) is never acceptable; customers require a fast and unified experience; and data of all types is growing at unimaginable rates. Solutions such as Riak are designed to handle these shifting priorities.
Top Reasons to Move to Riak
- Zero Downtime: Distributed NoSQL solutions like Riak are designed for always-on availability. This means data is always read/write accessible and the system never goes down. Downtime, planned or unplanned, can make or break a customer experience.
- Ease-of-Scale: Traffic can be unpredictable. Businesses need to scale up quickly to handle peak loads during holidays or major releases, but then need to scale back down to save money. Riak makes it easy to add and remove any number of nodes as needed and automatically redistributes data across the cluster. Scaling up or down never needs to be a burden again.
- Flexible Data Model: From user generated data to machine-to-machine (M2M) activity, unstructured data is now commonplace. Riak can store any type of data easily with its simple key/value architecture.
- Global Data Locality: Every company is a global company and needs to provide consistent, low-latency experiences to everyone, regardless of physical location. Riak’s multi-datacenter replication makes it easy to set up datacenters wherever users are, for both geo-data locality and maintaining active backups.
Users That Switched to Riak
Many top companies have already moved from relational architectures to Riak. Here’s a look at a few that have made the switch.
Bump (acquired by Google)
Bump, acquired by Google in 2013, allows users to share contact information and photos by bumping two phones together. Bump uses Riak to store almost all of its user data: contacts, communications sent and received, handset information, social network OAuth tokens, etc. Bump moved from MySQL to Riak due to its operational qualities: “No longer will we have to do any master/slave song and dance, nor will we fret about performance, capacity, or scalability; if we need more, we’ll just add nodes to the cluster.” Learn more about their move in their case study.
Alert Logic helps companies defend against security threats and address compliance mandates, such as PCI and HIPAA. Alert Logic switched from MySQl to Riak to collect and process machine data and to perform real-time analytics, detect anomalies, ensure compliance, and proactively respond to threats. Alert Logic processes nearly 5TB/day in Riak and has achieved performance results of up to 35k operations/second. Learn more about how Alert Logic improved performance through Riak in our blog post.
The Weather Company
The Weather Company provides millions of people every day with the world’s best weather forecasts, content and data, connecting with them through television, online, mobile and tablet screens. Riak is central to The Weather Company’s weather data services platform that delivers real-time weather services to aerospace, insurance, energy, retail, media, government, and hospitality industries. Check out our blog to see why The Weather Company selected Riak over MySQL to support their massive big data needs.
Dell uses Riak as the core distributed database technology underlying its customer cloud management solutions. Riak is used to collect and manage data associated with customer application provisioning and scaling, application configuration management, usage governance, and cloud utilization monitoring. In 2012, Enstratius (acquired by Dell) switched to Riak from MySQL in order to provide cross-datacenter redundancy, high write availability, and fault tolerance. Check out the full Enstratius case study.
Data Modeling in Riak
Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Below is a chart with some simple approaches to building common application types with a key/value model.
|Session||User/Session ID||Session Data|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, eMail, UUID||User Attributes|
|Content||Title, Integer||Text, JSON/XML/HTML Document, Images, etc.|
January 3, 2013
Most teams considering using Riak come from a relational database background. From our webcast on moving from relational to Riak, the below slide deck covers an overview of Riak, how the architecture differs from a relational approach, the advantages for scaling and development, and what’s different about application building and database operating in a non-relational world. We also include a few stories of Riak users who replaced MySQL or added Riak to the mix.
Interested in learning more? Check out our overview, From Relational to Riak.
November 15, 2010
Oracle didn’t (and can’t) take away your open source software.
A few weeks ago Oracle caused a lot of confusion when they changed the makeup of the MySQL product line, including a “MySQL Classic Edition” version that does not cost money and does not include InnoDB. That combination in the product chart made many people wonder if InnoDB itself had ceased to be free in either the “free beer” or “free speech” sense. The people wondering and worrying included a few users of Innostore, the InnoDB-based storage engine that can be used with Riak.
Luckily, open source software doesn’t work that way.
Oracle didn’t really even try to do what some people thought; they just released a confusing product graph which they have since updated. The MySQL that most people think of first is MySQL Community Edition and it was not one of the editions mentioned in the chart that confused people. That version of MySQL, as well as all of the GPL components included in it such as InnoDB, remain free of cost and also available under the GPL.
This confusion eventually led to a public response from Oracle, so you can read it authoritatively if you like.
Even if someone wanted to, they couldn’t “take it back” in the way that some people feared. Existing software that has been legitimately available under an open source license such as GPL or Apache cannot retroactively be made unfree. The copyright owner might choose to not license future improvements as open source, but that which is already released in such a way cannot be undone. Oracle and Innobase aren’t currently putting new effort into Embedded InnoDB, but a new project has spun up to move it forward. If the HailDB project produces improvements of value, then future versions of Innostore may switch to using that engine instead of using the original Embedded InnoDB release.
July 21, 2010
A few members of the Basho Team are at OSCON all week. We are here to take part in the amazing talks and tutorials, but also to talk to Riak users and community members.
Yesterday I had the opportunity to have a brief chat with Andrew Harvey, a developer who hails from Sydney, Australia and works for a startup called Lexer. They are building some awesome applications around brand monitoring and analytics, and Riak is helping in that effort.
In this short clip, Andrew gives me the scoop on Lexer and shares a few details around why and how they are using Riak (and MySQL) at Lexer.
(Deepest apologies for the shakiness. I forgot the Tripod.)
March 19, 2010
One of the challenges of switching from a relational database (Oracle, MySQL, etc.) to a “NoSQL” database like Riak is understanding how to represent your data within the database. This post is the beginning of a series of entries on how to structure your data within Riak in useful ways.
Choices have consequences
There are many reasons why you might choose Riak for your database, and I’m going to explain how a few of those reasons will affect the way your data is structured and manipulated.
One oft-cited reason for choosing Riak, and other alternative databases, is the need to manage huge amounts of data, collectively called “Big Data”. If you’re storing lots of data, you’re less likely to be doing online queries across large swaths of the data. You might be doing real-time aggregation in addition to calculating longer-term information in the background or offline. You might have one system collecting the data and another processing it. You might be storing loosely-structured information like log data or ad impressions. All of these use-cases call for low ceremony, high availability for writes, and little need for robust ways of finding data — perfect for a key/value-style scheme.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.
The third reason for choosing Riak I want to discuss is related to CAP – in that Riak prefers A (Availability) over C (Consistency). In contrast to a traditional relational database system, in which transactional semantics ensure that a datum will always be in a consistent state, Riak chooses to accept writes even if the state of the object has been changed by another client (in the case of a race-condition), or if the cluster was partitioned and the state of the object diverges. These architecture choices bring to the fore something we should have been considering all along — how should our applications deal with inconsistency? Riak lets you choose whether to let the “last one win” or to resolve the conflict in your application by automated or human-assisted means.
More mindful domain modeling
What’s the moral of these three stories? When modeling your data in Riak, you need to understand better the shape of your data. You can no longer rely on normalization, foreign key constraints, secondary indexes and transactions to make decisions for you.
Questions you might ask yourself when designing your schema:
- Will my access pattern be read-heavy, write-heavy, or balanced?
- Which datasets churn the most? Which ones require more sophisticated conflict resolution?
- How will I find this particular type of data? Which method is most efficient?
- How independent/interrelated is this type of data with this other type of data? Do they belong together?
- What is an appropriate key-scheme for this data? Should I choose my own or let Riak choose?
- How much will I need to do online queries on this data? How quickly do I need them to return results?
- What internal structure, if any, best suits this data?
- Does the structure of this data promote future design modifications?
- How resilient will the structure of the data be if requirements change? How can the change be effected without serious interruption of service?
I like to draw up my domain concepts on a pad of unlined paper or a whiteboard with boxes and arrows, then figure out how they map onto the database. Ultimately, the concepts define your application, so get those solid before you even worry about Riak.
Once you’ve thought carefully about the questions described above, it’s time think about how your data will map to Riak. We’ll start from the small-scale in this post (single domain concepts) and work our way out in future installments.
For a single class of objects in your domain, let’s consider the structure of that data. Here’s where you’re going to decide two interrelated issues — how this class of data will be queried and how opaque its internal structure will be to Riak.
The first issue, how the data will be queried, depends partly on how easy it is to intuit the key of a desired object. For example, if your data is user profiles that are mostly private, perhaps the user’s email or login name would be appropriate for the key, which would be easy to establish when the user logs in. However, if the key is not so easy to determine, or is arbitrary, you will need map-reduce or link-walking to find it.
The second issue, how opaque the data is to Riak, is affected by how you query but also by the nature of the data you’re storing. If you need to do intricate map-reduce queries to find or manipulate the data, you’ll likely want it in a form like JSON (or an Erlang term) so your map and reduce functions can reason about the data. On the other hand, if your data is something like an image or PDF, you don’t want to shoehorn that into JSON. If you’re in the situation where you need both a form that’s opaque to Riak, and to be able to reason about it with map-reduce, have your application add relevant metadata to the object. These are created using X-Riak-Meta-* headers in HTTP or riak_object:update_metadata/2 in Erlang.
Rule of thumb: if it’s an abstract datatype, use a map-reduce-friendly format like JSON; if it’s a concrete form, use its original representation. Of course, there are exceptions to every rule, so think carefully about your modeling problem.
Consistency, replication, conflict resolution
The second issue I would consider for each type of data is the access pattern and desired level of consistency. This is related to the questions above of read/write loads, churn, and conflicts.
Riak provides a few knobs you can turn at schema-design time and at request-time that relate to these issues. The first is allow_mult, or whether to allow recording of divergent versions of objects. In a write-heavy load or where clients are updating the same objects frequently, possibly at the same time, you probably want this on (true), which you can change by setting the bucket properties. The tradeoffs are that the vector clock may grow quickly and your application will need to decide how to resolve conflicts.
The second knob you can turn is the n_val, or how many replicas of each object to store, also a per-bucket setting. The default value is 3, which will work for many applications. If you need more assurance that your data is going to withstand failures, you might increase the value. If your data is non-critical or in large chunks, you might decrease the value to get greater performance. Knowing what to choose for this value will depend on an honest assessment of both the value of your data and operational concerns.
The third knob you can turn is per-request quorums. For reads, this is the R request parameter: how many replicas need to agree on the value for the read to succeed (the default is 2). For writes, there are two parameters, W and DW. W is how many replicas need to acknowledge the write request before it succeeds (default is 2). DW (durable writes) is how many replica backends need to confirm that the write finished before the entire write succeeds (default is 0). If you need greater consistency when reading or writing your data, you’ll want to increase these numbers. If you need greater performance and can sacrifice some consistency, decrease them. In any case, your R, W, and DW values must be smaller than n_val if you want the request to succeed.
What do these have to do with your data model? Fundamentally understanding the structure and purpose of your data will help you determine how you should turn these knobs. Some examples:
- Log data: You’ll probably want low R and W values so that writes are accepted quickly. Because these are fire-and-forget writes, you won’t need allow_mult turned on. You might also want a low n_val, depending on how critical your data is.
- Binary files: Your n_val is probably the most significant issue here, mostly depending on how large your files are and how many replicas of them you can tolerate (storage consumption).
- JSON documents (abstract types): The defaults will work in most cases. Depending on how frequently the data is updated, and how many you update within a single conceptual operation with the application, you may want to enable allow_mult to prevent blind overwrites.