January 22, 2015
In speaking with Riak users, both open source and commercial, we are frequently told that Riak’s key/value model is more flexible and faster to develop against than a traditional relational database. Even though Riak is well suited for many applications, there are inevitable tradeoffs in terms of query options and data types that are available. With a key/value model, there is no concept of columns or rows, therefore Riak does not have join operations. Riak can be queried either directly via HTTP, the protocol buffers API and through various client libraries. However, there is no SQL or SQL-like language that is currently available.
Riak’s key/value data model does not preclude queryability. There are several powerful querying options including:
- Riak Search: Integration with Apache Solr provides full-text search and support for Solr’s client query APIs.
- Secondary Indexes: Secondary Indexes (2i) give developers the ability to tag an object stored in Riak with one or more query values. Indexes can be either integers or strings, and can be queried by either exact matches or ranges of values.
- MapReduce: Developers can leverage Riak MapReduce for tasks like filtering documents by tag, counting words in documents, and extracting links to related data.
For more information, check out the Riak documentation on Querying Data.
The table below illustrates key/value mappings for common application types. Remember that values in Riak are opaque and stored on disk as binaries – JSON or XML documents, images, text, etc. The way data is organized in Riak should take into account the unique needs of the application, including access patterns such as read/write distribution, latency differences between various operations, use of Riak features (including MapReduce, Search, Secondary Indexes), and more.
|Session||User/Session ID||Session Data|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, eMail, UUID||User Attributes|
|Content||Title, Integer||Text, JSON/XML/HTML Document, Images, etc.|
Consider, for example, one of the canonical use cases for Riak…storing user and session data. In a relational database, the “users” table is well known and, basically, provides a unique identifier per user, and then a series of identifying information about that user as individual columns such as:
- First name
- Last name
- Counter of Site Visits
- Paid Account Identifier
This data can then be used to correlate or count, paid users, common interests, etc. via a series of SQL queries against the row/column structure of the users table.
Riak, in contrast, provides flexibility in how this data can be modeled based upon the application use case. It may be desirable to create a Users bucket, with the UserName (or Unique Identifier) as the key and a JSON object storing all user attributes as the value. Or, as we describe in Data Modeling with Riak Data Types, leverage the power of Riak Data Types by creating a map type for each user storing:
- first and last name strings in the register type,
- interests as a set,
- a counter for visits,
- and a flag for paid account identifier.
One of the best ways to enable application interaction with objects (a key/value pair) in Riak is to provide structured bucket and key names for the objects. This approach often involves wrapping information about the object in the object’s location data itself.
For example, appending a timestamp, UUID, or Geographical coordinate, to a key’s name allows for fine grained queryability via simple lookup to locate and retrieve a specific set of information. Leveraging the same naming mechanism as created for users (UniqueID as the key) enables, in a separate sessions bucket, storing the UUID append with a timestamp as the key and the session data (in binary format) as the object. In this way, using the same UUID, I am able to obtain both user and session data stored in different buckets and in different formats.
For additional information, and more complex considerations such as modeling relationship and advanced social applications, see the Riak documentation on use cases and data modeling.
Resolving Data Conflicts
In any system that replicates data, conflicts can arise – e.g., if two clients update the same object at the exact same time or if not all updates have yet reached hardware that is experiencing lag. Riak is “eventually consistent” – while data is always available, not all replicas may have the most recent update at the same time, causing brief periods (generally on the order of milliseconds) of inconsistency while all state changes are synchronized.
However, Riak does provide features to detect and help resolve the statistically small number of incidents when data conflicts occur. When a read request is performed, Riak looks up all replicas for that object. By default, Riak will return the most updated version, determined by looking at the object’s vector clock. Vector clocks are metadata attached to each replica when it is created. They are extended each time a replica is updated to keep track of versions. Clients can also be allowed to resolve conflicts themselves.
Further, when an outdated object is discovered as part of a read request, Riak will automatically update the out-of-sync replica to make it consistent. Read Repair, a self-healing property of the database, will even update a replica that returns a “not_found” in the event that a node loses it due to physical failure.
Riak also features “Active Anti-Entropy,” which is an automatic self-healing property that runs in the background. Rather than waiting for a read request to trigger a replica repair (as with Read Repair), Active Anti-Entropy constantly uses a hash tree exchange to compare replicas of objects and automatically repairs or updates any that are divergent, missing, or corrupt. This can be beneficial for large clusters storing “stale” data.
More information on vector clocks, dotted version vectors, and conflict resolution can be found in the online documentation in the section regarding Causal Context.
Multi-site replication is quickly becoming critical for many of today’s platforms and applications. Not only does replication across multiple clusters provide geographic data locality – the ability to serve global traffic at low-latencies – it can also be an integral part of a disaster recovery or backup strategy. Other teams may use multi-site replication to maintain secondary data stores, both for failover as well as for performing intensive computation without disrupting production load. Multi-site replication is included in Basho’s commercial extension to Riak, Riak Enterprise, which also includes 24/7 support.
Multi-site replication in Riak works differently than the typical approach seen in the relational world, multi-master replication. In Riak’s multi-datacenter replication, one cluster acts as a “primary cluster.” The primary cluster handles replication request from one or more “secondary clusters” (generally located in datacenters in other regions or countries). If the datacenter with the primary cluster goes down, a secondary cluster can take over as the primary cluster. In this sense, Riak’s multi-datacenter capabilities are “masterless.”
In multi-datacenter replication, there are two primary modes of operation: full sync and real-time. In full sync mode, a complete synchronization occurs between primary and secondary cluster(s). In real-time mode, continual, incremental synchronization occurs – replication is triggered by new updates. Full sync is performed upon initial connection of a secondary cluster, and then periodically (by default, every 6 hours). Full sync is also triggered if the TCP connection between primary and secondary clusters is severed and then recovered.
Data transfer is unidirectional (primary->secondary). However, bidirectional synchronization can be achieved by configuring a pair of connections between clusters.
Full documentation for multi-datacenter replication in Riak Enterprise is available in the online documentation.
Modeling data in any non-relational solution requires a different way of thinking about the data itself. Rather than an assumption that all data cleanly fits into a structure of rows and columns, the data domain can be overlayed on the core Key/Value store (Riak) in a variety of ways. There are, however, distinct tradeoffs and benefits to understand.
Relational Databases have:
- Foreign keys and constraints
- Sophisticated query planners
- Declarative query language (SQL)
- A Key/Value model where the value is any unstructured data
- More data redundancy that provides better availability
- Eventual consistency
- Simplified query capabilities
- Riak Search
What you will gain:
- More flexible, fluid designs
- More natural data representations
- Scaling without pain
- Reduced operational complexity
For more information on Data Modeling, or to chat with a member of the Basho team on the topic, please request a Tech Talk.
January 6, 2015
If you have read about Riak, or seen a member of the Basho team present, you have probably heard the phrase “Your data is opaque to Riak.” While this is not, strictly, true with the inclusion of distributed Data Types in Riak 2.0, it was a phrase that hinted at the core structure of Riak itself.
Riak is a Key Value data store.
In a relational database, data is organized by tables that are separate and unique structures. Within these tables exist rows of data organized into columns. As such, interaction with the database is by retrieving or updating entire tables, individual rows, or a group of columns within a set of rows.
In contrast, Riak has a much simpler data model. An Object is both the largest and smallest element of data. As such, interaction with the database is by retrieving or modifying the entire object. There is no partial fetch or update of the data.
Keys in Riak are simply a binary value (or a string) that are used to identify Objects. The Key/Value pair (or Object) is stored in a higher level namespace called a Bucket. And, with Riak 2.0, there is an extra layer of abstraction known as Bucket Types.
This Key/Value/Bucket model enables broad flexibility in modeling the applications data domain with Riak as the data store for persistence.
Another NoSQL model that many are familiar with is the document store. Unlike the Key/Value model the data store is aware of the structure of the objects stored. These objects, or documents, are grouped into “collections” — which is analogous to a relational “table” — and the datastore provides a query mechanism to search collections for objects with particular attributes. When the data that is being persisted is easily rendered as a JSON document, a document store can seem a natural fit. Some common use cases include product catalog data and content management systems.
The Basho Docs have a lengthy tutorial entitled Using Riak as a Document Store that walks you through the process of leveraging Riak as a document store for a CMS. There are many approaches to modeling, but the tutorial demonstrates the power of Riak 2.0 features by combining the maps data type and indexing that data with Riak Search.
When the data you are persisting can be represented as JSON, and you require the ability to query the data, Riak 2.0 is an excellent solution for persisting and modeling document data. The flexibility of the Key/Value model, combined with the power of Riak Search and Riak Data Types, provide you with a highly scalable, highly available document store with rich, full-text query capabilities. In addition, the inclusion of the maps data type means that you don’t have to write complex client side resolution logic when faced with network partitions. Riak Data Types handle that conflict resolution automatically.
A scalable, available document store that is operationally simple may seem compelling enough to use Riak. But when you combine the characteristics of Riak with the multi-datacenter replication capabilities of Riak Enterprise, now you have a solution that enables you to bring your data operations closer to the end user.
Scalable, available, operationally simple, and replicated. That’s the power of using Riak as a document store.
April 23, 2014
Traditional database architectures were the default option for many pre-Internet use cases and architectures, such as MySQL, remain common today. However, these traditional solutions have limits that quickly become apparent as companies (and data) grow. Modern companies have changing priorities: downtime (planned or unplanned) is never acceptable; customers require a fast and unified experience; and data of all types is growing at unimaginable rates. Solutions such as Riak are designed to handle these shifting priorities.
Top Reasons to Move to Riak
- Zero Downtime: Distributed NoSQL solutions like Riak are designed for always-on availability. This means data is always read/write accessible and the system never goes down. Downtime, planned or unplanned, can make or break a customer experience.
- Ease-of-Scale: Traffic can be unpredictable. Businesses need to scale up quickly to handle peak loads during holidays or major releases, but then need to scale back down to save money. Riak makes it easy to add and remove any number of nodes as needed and automatically redistributes data across the cluster. Scaling up or down never needs to be a burden again.
- Flexible Data Model: From user generated data to machine-to-machine (M2M) activity, unstructured data is now commonplace. Riak can store any type of data easily with its simple key/value architecture.
- Global Data Locality: Every company is a global company and needs to provide consistent, low-latency experiences to everyone, regardless of physical location. Riak’s multi-datacenter replication makes it easy to set up datacenters wherever users are, for both geo-data locality and maintaining active backups.
Users That Switched to Riak
Many top companies have already moved from relational architectures to Riak. Here’s a look at a few that have made the switch.
Bump (acquired by Google)
Bump, acquired by Google in 2013, allows users to share contact information and photos by bumping two phones together. Bump uses Riak to store almost all of its user data: contacts, communications sent and received, handset information, social network OAuth tokens, etc. Bump moved from MySQL to Riak due to its operational qualities: “No longer will we have to do any master/slave song and dance, nor will we fret about performance, capacity, or scalability; if we need more, we’ll just add nodes to the cluster.” Learn more about their move in their case study.
Alert Logic helps companies defend against security threats and address compliance mandates, such as PCI and HIPAA. Alert Logic switched from MySQl to Riak to collect and process machine data and to perform real-time analytics, detect anomalies, ensure compliance, and proactively respond to threats. Alert Logic processes nearly 5TB/day in Riak and has achieved performance results of up to 35k operations/second. Learn more about how Alert Logic improved performance through Riak in our blog post.
The Weather Company
The Weather Company provides millions of people every day with the world’s best weather forecasts, content and data, connecting with them through television, online, mobile and tablet screens. Riak is central to The Weather Company’s weather data services platform that delivers real-time weather services to aerospace, insurance, energy, retail, media, government, and hospitality industries. Check out our blog to see why The Weather Company selected Riak over MySQL to support their massive big data needs.
Dell uses Riak as the core distributed database technology underlying its customer cloud management solutions. Riak is used to collect and manage data associated with customer application provisioning and scaling, application configuration management, usage governance, and cloud utilization monitoring. In 2012, Enstratius (acquired by Dell) switched to Riak from MySQL in order to provide cross-datacenter redundancy, high write availability, and fault tolerance. Check out the full Enstratius case study.
Data Modeling in Riak
Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Below is a chart with some simple approaches to building common application types with a key/value model.
|Session||User/Session ID||Session Data|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, eMail, UUID||User Attributes|
|Content||Title, Integer||Text, JSON/XML/HTML Document, Images, etc.|
April 16, 2014
The world of gaming can be unpredictable. It can be hard to judge if a game is going to be the next Angry Birds and experience exponential, global growth. Riak is designed to help gaming platforms handle this uncertainty with ease. Its focus on high availability means that all data remains accessibility, even during node failure. Its flexible data model and redundant, fault-tolerant design easily allows gaming platforms to store any type of data needed. Riak is also built for operational simplicity at scale, so Riak will seamlessly grow with data and popularity. Finally, the option for multi-datacenter replication means that gamers all over the world will get the same low-latency experience across multiple devices.
Top Use Cases for Riak in Gaming
- Player Data: Riak provides low-latency, highly available data storage for key player data, including user and profile information, game performance, statistics and rankings, and more. Riak also provides many different tools for querying and indexing this data, such as Riak Search, Secondary Indexing, and MapReduce.
- Session Storage: Riak is frequently used to store and serve session data with predictable low-latency – necessary for game play. Riak imposes no restrictions on the type of content stored (since all objects are stored on disk as binaries), so session data can be encoded in many ways and can evolve without administrative changes to schemas.
- Social Information: Riak provides flexible, robust storage for social data such as social graph information, player profiles and relationships, and social authentication tokens.
- Global Data Locality: When gaming, players require a low-latency experience, regardless of where they’re physically located. Otherwise, interrupted or slow game play can lead to poor user experience and possible user abandonment. Riak Enterprise’s multi-datacenter capabilities allow game data to be physically close to players and serve them data no matter where they happen to be.
Riak in Production
Riak is already in production by many top gaming platforms. Here’s a look at a few that have switched to Riak.
Rovio is the creator of the popular mobile game, Angry Birds. Since user growth can be hard to predict, they needed an infrastructure that could support unexpected viral growth without failing or causing downtime. They selected Riak due to its ease-of-scale and fault tolerance. Riak now powers their new cartoon series, Angry Birds Toons, and new mobile games. Learn more about why they moved to Riak in this case study and video from GDC.
Hibernum is a creator and developer of unique gaming experiences that combine the latest in social gaming, top quality visuals and animations, and cutting edge design. They switched from a relational database to Riak due to the high availability, ability to scale to peak loads, and predictable operational cost. Riak is used to store user game information for one of their most popular social games. Check out the complete case study, Hibernum Selects Riak for User Data Storage.
Kiip is a platform for building rewards and achievements into your games. Kiip replaced MongoDB with Riak in order to achieve low read/write latencies and horizontal scalability. Kiip uses Riak for storing and serving session and device data. Learn more from the video on scaling Riak to 25MM Ops/Day.
Riot Games is the creator of League of Legends and faced some challenges with supporting millions of concurrent players at any given moment. They switched to Riak from MySQL for their next generation stats system, which tracks gameplay statistics and stores terabytes of data that gets aggregated and presented to players in near real-time. More information on how they use Riak and why they selected it can be found here.
Data Modeling in Riak
Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Here are some common approaches to structuring gaming data with Riak’s key/value design:
|Player Data||Login, Email, UUID||Player Attributes (often stored as a JSON document); Player Rewards and Stats|
|Social Data||Login, Email, UUID||Player Profiles, Social Graph Information, Facebook/Twitter Tokens|
|Session Information||User/Session ID||Session Data|
|Image or Video Content||Content Name, ID or Integer||.JPG .PNG, .GIF or other image format; .MOV, .MPG, .MP4 or other video file format|
To learn more about how gaming platforms can use Riak for their data needs, check out the complete overview, “Gaming on Riak: A Technical Introduction.” To get started with Riak, Contact Us or download it now.
April 14, 2014
Modern day advertisers are faced with many new challenges to ensure they can provide highly available, low latency experiences to thousands of clients and partners, and millions of users. They are also tasked with serving large amounts of data all over the world and can experience significant traffic spikes. That is why advertisers are switching to Riak for their database solution. Riak’s redundant, fault-tolerant design ensures that advertising companies can serve data reliably and quickly. Riak is also built for operational simplicity at scale and helps advertisers quickly grow to meet peak loads.
Top Use Cases for Riak in Advertising
- Serving Ad Content: Riak’s rapid storage and content agnosticism makes it ideal for storing ad content and handling influxes of ad traffic.
- Session Storage: This type of data is naturally a good fit for Riak’s key/value model. This data can also be encoded in many different ways and can evolve without any administrative changes to the schema.
- Mobile Experiences: Riak is ideal for the low-latency, always-available small object storage needed to power mobile experiences across platforms.
- Global Data Locality: Riak Enterprise’s multi-datacenter capabilities allow advertisers to maintain a global data footprint while providing an always-on, low-latency experience, anywhere in the world.
Riak in Production
Riak is already in production at many top advertising and marketing organizations. Here’s a look at a few that have switched to Riak.
Tapjoy is a mobile advertising and monetization platform that is available on over one billion devices across the world. They selected Riak due to its high availability, low-latency, and multi-datacenter replication. They store 48TB of data in Riak and operate hundreds of thousands of reads/writes per second. Learn more about why Tapjoy selected Riak from the case study.
OpenX is an ad technology platform that serves trillions of ads. They use Riak for user and trafficking data behind their data services API. OpenX also uses Riak’s multi-datacenter replication across several data centers. Watch Anthony Molinaro (Infrastructure Architect at OpenX) talk about how they use Riak for their serve-time data needs.
Velti is a mobile marketing and advertising technology provider. They use Riak for their interactive mobile platform, including letting people interact with their TV by voting, giving feedback, participating in contests, etc. Velti runs 18 nodes across two data centers, which provides them with scale, durability, and availability. Their case study goes into more detail about the process of moving to Riak.
JBA is a digital consultancy that specializes in developing customer understanding and behavioral targeting. They use Riak as a core part of their behavioral analysis and remarketing tool. They store over 10 million objects in Riak and can easily scale up to account for holiday sales cycles or new product releases as needed. Learn more about why they selected Riak from the beginning from their case study.
Moz provides analytics software to track all of a website’s inbound marketing efforts on one platform. They support over 27,000 customers and 300,000 community members worldwide. Moz uses Riak to store customer campaign search engine rankings data. Learn more about how Riak outperformed Cassandra at Moz in the case study.
Data Modeling in Riak
Riak has a “schemaless” design. Objects are comprised of key/value pairs, which are stored in flat namespaces called buckets. Here are some common approaches to structuring advertising data with Riak’s key/value design:
|Advertisement||Campaign ID||Ad Content|
|User Data||Login, Email, UUID||User Attributes (often stored as a JSON document)|
|Image or Video Content||Content Name, ID or Integer||.JPG, .PNG, .GIF or other image format; .MOV, .MPG, .MP4 or other video file format|
|Session Information||User or Session ID||Session Data|
To learn more about how advertisers can use Riak for their data needs, check out the complete overview, “Advertisers on Riak: A Technical Introduction.” To get started with Riak, Contact Us or download it now.
November 18, 2013
This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.” The previous post in the series discussed High Availability and Cost of Scale.
In order to provide high availability, which is a cornerstone of Riak’s value proposition, the database stores several copies of each key/value pair.
This availability requirement leads to a fundamental tradeoff: in order to continue to serve requests in the presence of failure, we do not force all data in the cluster to stay in sync. Riak will allow writes and reads no matter how many servers (and their stored replicas) are offline or otherwise unreachable.
(Incidentally, this lack of strong coordination has another consequence beyond high availability: Riak is a very, very fast database.)
Riak does provide both active and passive self-healing mechanisms to minimize the window of time during which two servers may have different versions of data.
The concept of eventual consistency may seem unfamiliar, but if you’ve ever implemented a cache or used DNS, those are common examples of the idea. In a large enough system, it’s effectively the default state of all data.
However, with the forthcoming release of Riak 2.0, operators will be able to designate selected pieces of data to require coordination and maintain strong consistency over high availability. Writing such data will be slower and subject to failure if too many servers are unreachable, but the overall robust architecture of Riak will still provide a fast, highly available solution.
Riak stores data using a simple key/value model, which offers developers tremendous flexibility to define access models that suit their applications. It is also content-agnostic, so developers can store arbitrary data in any convenient format.
Instead of forcing application-specific data structures to be mapped into (and out of) a relational database, they can simply be serialized and dropped directly into Riak. For records that will be frequently updated, if some of the fields are immutable and some aren’t, we recommend keeping the immutable data in one key/value pair and the rest organized into a single or multiple objects based on update patterns.
Relational databases are ingrained habits for many of us, but moving beyond them can be liberating. Further information about data modeling, including sample configurations, are available on Use Cases section of the documentation.
One tradeoff with this simpler data model is that there is no SQL or SQL-like language with which to query the data.
To achieve optimal performance, it is advisable to take advantage of the flexibility of the key/value model to define simple retrieval patterns. In other words, determine the most useful queries and write the results of those queries as the data is being processed.
Because it is not always possible to know in advance what questions will need to be asked of your data, Riak offers added functionality on top of the key/value model. Tools such as Riak Search (a distributed, full-text search engine), Secondary Indexing (ability to tag objects with queryable metadata), and MapReduce (leveraged for aggregation tasks) are available to perform ad hoc queries as needed.
For many users, the tradeoffs of moving to Riak are worthwhile due to the overall benefits; however, it can be a bit of an adjustment. To see why others have chosen to switch to Riak from both relational systems and other NoSQL databases, check out our Users Page.
March 5, 2013
Mobile platforms need to provide always available, low-latency experiences that can scale to millions of users and support highly concurrent access. Riak’s redundant and fault-tolerant design ensures mobile data can be served quickly and reliably, and Riak is run in production by many popular mobile applications. For a full overview, check out the whitepaper “Mobile on Riak: A Technical Introduction.” Below are a few key mobile use cases and basic approaches to modeling them in Riak:
User Data: Storing user accounts, profile information, and events is a common use case for Riak. Mobile apps often store this data in JSON documents, using a UUID or other identifier as the key. Data can be queried through Riak features such as secondary indexes, MapReduce, and full-text search.
Session Data: Since session IDs are commonly stored in cookies, or otherwise known at lookup time, they are a natural fit for Riak’s key/value model and Riak can serve these requests at predictably low-latency. Session data can also be encoded in many different ways and evolve without any administrative changes to schema.
Text & Multimedia Storage: Since Riak is content agnostic, mobile platforms can easily store a variety of different types of data, including audio, text, photos, video, etc. to power mobile experiences.
Social Authentication: Many mobile applications have users sign in via their Facebook or Twitter accounts. Riak’s key/value scheme makes it easy to store both registered accounts and the tokens that make it possible for users to authenticate with their social accounts.
Global Data Locality: Riak Enterprise’s multi-datacenter capabilities mean mobile data can be stored in physical proximity to users and served at low-latency no matter where they happen to be.
Here is a chart with possible ways these applications and services can be modeled using Riak’s key/value design. Of course, your application should be structured in a way appropriate to its access and query patterns, among other factors – this is just to get you started. For more information on designing applications with Riak, check out our documentation.
To learn more about how mobile platforms can use Riak for their data needs, check out the complete overview, “Mobile on Riak: A Technical Introduction.” For more details about Riak and the latest 1.3 release, sign up for our webcast on March 7th.
January 31, 2013
This is the second in a series of blog posts covering Riak for retail and eCommerce platforms. To learn more, join our “Retail on Riak” webcast on Friday, February 8th or download the “Riak for Retail” whitepaper.
In our last post, we looked at three Riak users in the eCommerce/retail space. In this post, we will look at some common use cases for Riak and how to start building them with Riak’s key/value model and querying features.
- Shopping Carts: Riak’s focus on availability makes it attractive to retailers offering shopping carts and other “buy now” functionality. If the shopping cart is unavailable, loses product additions, or responds slowly to users, it has a direct impact on revenue and user trust.
- Product Catalogs: Retailers need to store anywhere from thousands to tens of thousands of inventory items and associated information – such as photos, descriptions, prices, and category information. Riak’s flexible, fast storage makes it a good fit for this type of data.
- User Information: As mobile, web, and multi-channel shopping become more social and personalized, retailers have to manage increasing amounts of user information. Riak scales efficiently to meet increased data and traffic needs and ensures user data is always available for online shopping experiences.
- Session Data: Riak provides a highly reliable platform for session storage. User/session IDs are usually stored in cookies or otherwise known at lookup time, a good fit for Riak’s key/value mode.
In Riak, objects are comprised of key/value pairs, which are stored in flat namespaces called “buckets.” Riak is content-type agnostic, and stores all objects on disk as binaries, giving retailers lots of flexibility to store anything they want. Here are some common approaches to modeling the data and services discussed above in Riak:
Riak provides several features for querying data:
Riak Search: Riak Search is a distributed, full-text search engine. It provides support for various MIME types & analyzers, and robust querying.
Possible Use Cases: Searching product information or product descriptions.
Secondary Indexing: Secondary Indexing (2i) gives developers the ability, at write time, to tag an object stored in Riak with one or more values, queryable by exact matches or ranges of an index.
Possible Use Cases: Tagging products with categories, special promotion identifiers, date ranges, price or other metadata.
Possible Use Cases: Filtering product information by tag, counting items, and extracting links to related products.
Check out our docs for more information on building applications and services with Riak.
March 19, 2010
One of the challenges of switching from a relational database (Oracle, MySQL, etc.) to a “NoSQL” database like Riak is understanding how to represent your data within the database. This post is the beginning of a series of entries on how to structure your data within Riak in useful ways.
Choices have consequences
There are many reasons why you might choose Riak for your database, and I’m going to explain how a few of those reasons will affect the way your data is structured and manipulated.
One oft-cited reason for choosing Riak, and other alternative databases, is the need to manage huge amounts of data, collectively called “Big Data”. If you’re storing lots of data, you’re less likely to be doing online queries across large swaths of the data. You might be doing real-time aggregation in addition to calculating longer-term information in the background or offline. You might have one system collecting the data and another processing it. You might be storing loosely-structured information like log data or ad impressions. All of these use-cases call for low ceremony, high availability for writes, and little need for robust ways of finding data — perfect for a key/value-style scheme.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.
The third reason for choosing Riak I want to discuss is related to CAP – in that Riak prefers A (Availability) over C (Consistency). In contrast to a traditional relational database system, in which transactional semantics ensure that a datum will always be in a consistent state, Riak chooses to accept writes even if the state of the object has been changed by another client (in the case of a race-condition), or if the cluster was partitioned and the state of the object diverges. These architecture choices bring to the fore something we should have been considering all along — how should our applications deal with inconsistency? Riak lets you choose whether to let the “last one win” or to resolve the conflict in your application by automated or human-assisted means.
More mindful domain modeling
What’s the moral of these three stories? When modeling your data in Riak, you need to understand better the shape of your data. You can no longer rely on normalization, foreign key constraints, secondary indexes and transactions to make decisions for you.
Questions you might ask yourself when designing your schema:
- Will my access pattern be read-heavy, write-heavy, or balanced?
- Which datasets churn the most? Which ones require more sophisticated conflict resolution?
- How will I find this particular type of data? Which method is most efficient?
- How independent/interrelated is this type of data with this other type of data? Do they belong together?
- What is an appropriate key-scheme for this data? Should I choose my own or let Riak choose?
- How much will I need to do online queries on this data? How quickly do I need them to return results?
- What internal structure, if any, best suits this data?
- Does the structure of this data promote future design modifications?
- How resilient will the structure of the data be if requirements change? How can the change be effected without serious interruption of service?
I like to draw up my domain concepts on a pad of unlined paper or a whiteboard with boxes and arrows, then figure out how they map onto the database. Ultimately, the concepts define your application, so get those solid before you even worry about Riak.
Once you’ve thought carefully about the questions described above, it’s time think about how your data will map to Riak. We’ll start from the small-scale in this post (single domain concepts) and work our way out in future installments.
For a single class of objects in your domain, let’s consider the structure of that data. Here’s where you’re going to decide two interrelated issues — how this class of data will be queried and how opaque its internal structure will be to Riak.
The first issue, how the data will be queried, depends partly on how easy it is to intuit the key of a desired object. For example, if your data is user profiles that are mostly private, perhaps the user’s email or login name would be appropriate for the key, which would be easy to establish when the user logs in. However, if the key is not so easy to determine, or is arbitrary, you will need map-reduce or link-walking to find it.
The second issue, how opaque the data is to Riak, is affected by how you query but also by the nature of the data you’re storing. If you need to do intricate map-reduce queries to find or manipulate the data, you’ll likely want it in a form like JSON (or an Erlang term) so your map and reduce functions can reason about the data. On the other hand, if your data is something like an image or PDF, you don’t want to shoehorn that into JSON. If you’re in the situation where you need both a form that’s opaque to Riak, and to be able to reason about it with map-reduce, have your application add relevant metadata to the object. These are created using X-Riak-Meta-* headers in HTTP or riak_object:update_metadata/2 in Erlang.
Rule of thumb: if it’s an abstract datatype, use a map-reduce-friendly format like JSON; if it’s a concrete form, use its original representation. Of course, there are exceptions to every rule, so think carefully about your modeling problem.
Consistency, replication, conflict resolution
The second issue I would consider for each type of data is the access pattern and desired level of consistency. This is related to the questions above of read/write loads, churn, and conflicts.
Riak provides a few knobs you can turn at schema-design time and at request-time that relate to these issues. The first is allow_mult, or whether to allow recording of divergent versions of objects. In a write-heavy load or where clients are updating the same objects frequently, possibly at the same time, you probably want this on (true), which you can change by setting the bucket properties. The tradeoffs are that the vector clock may grow quickly and your application will need to decide how to resolve conflicts.
The second knob you can turn is the n_val, or how many replicas of each object to store, also a per-bucket setting. The default value is 3, which will work for many applications. If you need more assurance that your data is going to withstand failures, you might increase the value. If your data is non-critical or in large chunks, you might decrease the value to get greater performance. Knowing what to choose for this value will depend on an honest assessment of both the value of your data and operational concerns.
The third knob you can turn is per-request quorums. For reads, this is the R request parameter: how many replicas need to agree on the value for the read to succeed (the default is 2). For writes, there are two parameters, W and DW. W is how many replicas need to acknowledge the write request before it succeeds (default is 2). DW (durable writes) is how many replica backends need to confirm that the write finished before the entire write succeeds (default is 0). If you need greater consistency when reading or writing your data, you’ll want to increase these numbers. If you need greater performance and can sacrifice some consistency, decrease them. In any case, your R, W, and DW values must be smaller than n_val if you want the request to succeed.
What do these have to do with your data model? Fundamentally understanding the structure and purpose of your data will help you determine how you should turn these knobs. Some examples:
- Log data: You’ll probably want low R and W values so that writes are accepted quickly. Because these are fire-and-forget writes, you won’t need allow_mult turned on. You might also want a low n_val, depending on how critical your data is.
- Binary files: Your n_val is probably the most significant issue here, mostly depending on how large your files are and how many replicas of them you can tolerate (storage consumption).
- JSON documents (abstract types): The defaults will work in most cases. Depending on how frequently the data is updated, and how many you update within a single conceptual operation with the application, you may want to enable allow_mult to prevent blind overwrites.