Tag Archives: MapReduce

Riak Development Anti-Patterns

January 7, 2014

Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.

Thus, this looks at some of the common conceptual challenges.

Dynamic Querying

Riak offers query features such as secondary indexes (2i), MapReduce, and full-text search, but throwing a large quantity of data into Riak and expecting those tools to find whatever you need is setting yourself (and Riak) up to fail. Performance will be poor, especially as you scale.

Reads and writes in Riak should be as fast with ten billion values in storage as with ten thousand.

Key/value operations seem primitive (and they are) but you’ll find they are flexible, scalable, and very fast (and predictably so).

Treat 2i and friends as tools to be applied judiciously, design the main functionality of your application as if they don’t exist, and your software will continue to work at blazing speeds when you have petabytes of data stored across dozens of servers.

Normalization

Normalizing data is generally a useful approach in a relational database, but unlikely to lead to happy results with Riak.

Riak lacks foreign key constraints and join operations, two vital parts of the normalization story, so reconstructing a single record from multiple objects would involve multiple read requests; certainly possible and fast enough on a small scale, but not ideal for larger requests.

Instead, imagine the performance of your application if most of your requests were a single, trivial read. Preparing and storing the answers to queries you’re going to ask later is a best practice for Riak.

Ducking Conflict Resolution

One of the first hurdles Basho faced when releasing Riak was educating developers on the complexities of eventual consistency and the need to intelligently resolve data conflicts.

Because Riak is optimized for high availability, even when servers are offline or disconnected from the cluster due to network failures, it is not uncommon for two servers to have different versions of a piece of data.

The simplest approach to coping with this is to allow Riak to choose a winner based on timestamps. It can do this more effectively if developers follow Basho’s guidance on sending updates with vector clock metadata to help track causal history, but often concurrent updates cannot be automatically resolved via vector clocks, and trusting server clocks to determine which write was the last to arrive is a terrible conflict resolution method.

Even if your server clocks are magically always in sync, are your business needs well-served by blindly applying the most recent update? Some databases have no alternative but to handle it that way, but we think you deserve better.

Riak 2.0, when installed on new clusters, will default to retaining conflicts and requiring the application to resolve them, but we’re also providing replicated data types to automate conflict resolution on the servers.

If you want to minimize the need for conflict resolution, modeling with as much immutable data as possible is a big win.

Mutability

For years, functional programmers have been singing the praises of immutable data, and it confers significant advantages when using a distributed data store like Riak.

Most obviously, conflict resolution is dramatically simplified when objects are never updated.

Even in the world of single-server database servers, updating records in place carries costs. Most databases lose all sense of history when data is updated, and it’s entirely possible for two different clients to overwrite the same field in rapid succession leading to unexpected results.

Some data is always going to be mutable, but thinking about the alternative can lead to better design.

SELECT * FROM <table>

A perfectly natural response when first encountering a populated database is to see what’s in it. In a relational database, you can easily retrieve a list of tables and start browsing their records.

As it turns out, this is a terrible idea in Riak.

Riak is optimized for unstructured, opaque data; however, it is not designed to allow for trivial retrieval of lists of buckets (very loosely analogous to tables) and keys.

Doing so can put a great deal of stress on a large cluster and can significantly impact performance.

It’s a rather unusual idea for someone coming from a relational mindset, but being able to algorithmically determine the key that you need for the data you want to retrieve is a major part of the Riak application story.

Large Objects

Because Riak sends multiple copies of your data around the network for every request, values that are too large can clog the pipes, so to speak, causing significant latency problems.

Basho generally recommends 1-4MB objects as a soft cap; larger sizes are possible with careful tuning, however.

For significantly larger objects, Riak CS offers an Amazon S3-compatible (and also OpenStack Swift-compatible) key/value object store that uses Riak under the hood.

Running a Single Server

This is more of an operations anti-pattern, but it is a common misunderstanding of Riak’s architecture.

It is quite common to install Riak in a development environment using its devrel build target, which creates five full Riak stacks (including Erlang virtual machines) to run on one server to simulate a cluster.

However, running Riak on a single server for benchmarking or production use is counterproductive, regardless of whether you have one stack or five on the box.

It is possible to argue that Riak is more of a database coordination platform than a database itself. It uses Bitcask or LevelDB to persist data to disk, but more importantly, it commonly uses at least 64 such embedded databases in a cluster.

Needless to say, if you run 64 databases simultaneously on a single filesystem you are risking significant I/O and CPU contention unless the environment is carefully tuned (and has some pretty fast disks).

Perhaps more importantly, Riak’s core design goal, its raison d’être, is high availability via data redundancy and related mechanisms. Writing three copies of all your data to a single server is mostly pointless, both contributing to resource contention and throwing away Riak’s ability to survive server failure.

So, Now What?

As always, we recommend visiting Basho’s docs website for more details on how to build and run Riak, and many of our customers have given presentations on their use cases of Riak, including data modeling.

Also, keep an eye on the Basho blog where we provide high-level overviews like this of Riak and the larger non-relational database world.

For a detailed analysis of your needs and modeling options, contact Basho regarding our professional services team.

Further Reading

John Daily

"How is Riak different from Hadoop?"

October 28, 2013

The technology community is extremely agile and fast-paced. It can turn on a dime to solve business problems as they arise. However, with this agility comes budding terminology that can often provide false categorizations. This can lead to confusion, especially when companies evaluate new technologies based on a surface understanding of these terms. The world of data is full of these terms, including the notorious “NoSQL” and “big data.”

As described in a previous post, NoSQL is a misleading term. This term represents a response to changing business priorities that require more flexible, resilient architectures (as opposed to the traditional, rigid systems that often happen to use SQL). However, within the NoSQL space, there are dozens of players that can be as different from one another as they are from any of the various SQL-speaking systems.

Big data is another term that, while fairly self-explanatory, has been overused to the point of dilution. One reason why NoSQL databases have become necessary is because of their ability to easily scale to keep up with data growth. Simply storing a lot of data isn’t the solution though. Some data is more critical than others (and should be accessible no matter what) and some data needs to be analyzed to provide business insights. When digging into a business, big data is too vague a term to describe both of these use cases.

As these terms (to highlight a few) are used, it can lead to industry confusion. One area of confusion that we have experienced relates to Basho’s own distributed database, Riak, and the distributed processing system, Hadoop.

While these two systems are actually complementary, we are often asked “How is Riak different from Hadoop?”

To help explain this, it’s important to start with a basic understanding of both systems. Riak is a distributed database that is built for high availability, fault tolerance, and scalability. It is best used to store large amounts of critical data that applications and users need to constantly be able to access. Riak is built by Basho Technologies and can be used as an alternative to or in conjunction with relational databases (such as MySQL) or to other “NoSQL” databases (such as MongoDB or Cassandra).

Hadoop is a framework that allows for the distributed parallel processing of large data sets across clusters of computers. It was originally based on the “MapReduce” system, which was invented by Google. Hadoop consists of two core parts: the underlying Hadoop Distributed File System (HDFS), which ensures stored data is always available to be analyzed, and MapReduce, which allows for scalable computation by dividing and running queries over multiple machines. Hadoop provides an inexpensive, scalable solution for bulk data processing and is mostly used as part of an overarching analytics strategy, not for primary “hot” data storage.

One easy way to distinguish between the two is to look at some of the common use cases.

Riak Use Cases

Riak can be used by any application that needs to always have access to large amounts of critical data. Riak uses a key/value data model and is data-type agnostic, so operators can store any type of content in Riak. Due to the key/value model, certain industry use cases fit easily into Riak. These include:

  • Gaming – storing player data, session data, etc
  • Retail – underpinning shopping carts, product inventories, etc
  • Mobile – social authentication, text and multimedia storage, global data locality, etc
  • Advertising – serving ad content, session storage, mobile experiences, etc
  • Healthcare – prescription or patient records, patient IDs, health data that must always be available across a network of providers, etc

For a full list of use cases, check out our Users Page.

Hadoop Use Cases

Hadoop is designed for situations where you need to store unmodeled data and run computationally intensive analytics over that data. The original use cases of both MapReduce and Hadoop were to produce indexes for distributed search engines at Google and Yahoo respectively. Any industry that needs to do large scale analytics to better improve their business can use Hadoop. Some common examples include finance (build models to do accurate portfolio evaluations and risk analysis) and eCommerce (analyze shopping behavior to deliver product recommendations or better search results).

Riak and Hadoop are based on many of the same tenets, making their usage complementary for some companies. Many companies that utilize Riak today have created scripts, or processes, to pull data from Riak and push into other solutions (like Hadoop) for the purpose of historical archiving or future analysis. Recognizing this trend, Basho is exploring the creation of additional tools to simplify this process.

If you are interested in our thinking on these data export capabilities, please contact us.

In Summary

Every tool has its value. Hadoop excels at being used by a relatively small subset of the business to answer big questions. Riak excels at being used by a very large number of users and powering critical data for businesses.

Basho

Top Five Questions About Riak

April 17, 2013

This post looks at five commonly asked questions about Riak. For more questions and answers, check out our Riak FAQ.

What hardware should I use with Riak?

Riak is designed to be run on commodity hardware and is run in production on a variety of different server types on both private and public infrastructure. However, there are several key considerations when choosing the right infrastructure for your Riak deployment.

RAM is one of the most important factors – RAM availability directly affects what Riak backend you should use (see question below), and is also required for complex MapReduce queries. In terms of disk space, Riak automatically replicates data according to a configurable n_val. A bucket-level property that defaults to 3, n_val determines how many copies of each object will be stored, and provides the inherent redundancy underlying Riak’s fault-tolerance and high availability. Your hardware choice should take into consideration how many objects you plan to store and the replication factor, however, Riak is designed for horizontal scale and lets you easily add capacity by joining additional nodes to your cluster. Additional factors that might affect choice of hardware include IO capacity, especially for heavy write loads, and intra-cluster bandwidth. For additional factors in capacity planning, check out our documentation on cluster capacity planning.

Riak is explicitly supported on several cloud infrastructure providers. Basho provides free Riak AMIs for use on AWS. We recommend using large, extra large, and cluster compute instance types on Amazon EC2 for optimal performance. Learn more in our documentation on performance tuning for AWS. Engine Yard provides hosted Riak solutions, and we also offer virtual machine images for the Microsoft VM Depot.

What backend is best for my application?

Riak offers several different storage backends to support use cases with different operational profiles. Bitcask and LevelDB are the most commonly used backends.

Bitcask was developed in-house at Basho to offer extremely fast read/write performance and high throughput. Bitcask is the default storage engine for Riak and ships with it. Bitcask uses an in-memory hash-table of all keys you write to Riak, which points directly to the on-disk location of the value. The direct lookup from memory means Bitcask never uses more than one disk seek to read data. Writes are also very fast with Bitcask’s write-once, append-only design. Bitcask also offers benefits like easier backups and fast crash recovery. The inherent limitation is that your system must have enough memory to contain your entire keyspace, with room for a few other operational components. However, unless you have an extremely large number of keys, Bitcask fits many datasets. Visit our documentation for more details on Bitcask, and use the Bitcask Capacity Calculator to assist you with sizing your cluster.

LevelDB is an open-source, on-disk key-value store from Google. Basho maintains a version of LevelDB tuned specifically for Riak. LevelDB doesn’t have Bitcask’s memory constraints around keyspace size, and thus is ideal for deployments with a very large number of keys. In addition to this advantage, LevelDB uses Google Snappy data compression, which provides particular efficiency for text data like raw text, Base64, JSON, HTML, etc. To use LevelDB with Riak, you must the change the storage backend variable in the app.config file. You can find more details on LevelDB here.

Riak also offers a Memory storage backend that does not persist data and is used simply for testing or small amounts of transient state. You can also run multiple backends within a single Riak instance, which is useful if you want to use different backends for different Riak buckets or use a different storage configuration for some buckets. For in-depth information on Riak’s storage backends, see our documentation on choosing a backend.

How do I model data using Riak’s key/value design?

Riak uses a key/value design to store data. Key/value pairs comprise objects, which are stored in buckets. Buckets are flat namespaces with some configurable properties, such as the replication factor. One frequent question we get is how to build applications using the key/value scheme. The unique needs of your application should be taken into account when structuring it, but here are some common approaches to typical use cases. Note that Riak is content-agnostic, so values can be any content type.

Data Type Key Value
Session User/Session ID Session Data
Content Title, Integer Document, Image, Post, Video, Text, JSON/HTML, etc.
Advertising Campaign ID Ad Content
Logs Date Log File
Sensor Date, Date/Time Sensor Updates
User Data Login, Email, UUID User Attributes

For more comprehensive information on building applications with Riak’s key/value design, view the use cases section of our documentation.

What other options, besides strict key/value access, are there for querying Riak?

Most operations done with Riak will be reading and writing key/value pairs to Riak. However, Riak exposes several other features for searching and accessing data: MapReduce, full-text search, and secondary indexing.

MapReduce provides non-primary key based querying that divides work across the Riak distributed database. It is useful for tasks such as filtering by tags, counting words, extracting links, analyzing log files, and aggregation tasks. Riak provides both Javascript and Erlang MapReduce support. Jobs written in Erlang are generally more performant. You can find more details about Riak MapReduce here.

Riak also provides Riak Search, a full-text search engine that indexes documents on write and provides an easy, robust query language and SOLR-like API. Riak Search is ideal for indexing content like posts, user bios, articles, and other documents, as well as indexing JSON data. For more information, see the documentation on Riak Search.

Secondary indexing allows you to tag objects in Riak with one or more queryable values. These “tags” can then be queried by exact or range value for integers and strings. Secondary indexing is great for simple tagging and searching Riak objects for additional attributes. Check out more details here.

How does Riak differ from other databases?

We often get asked how Riak is different from other databases and other technologies. While an in-depth analysis is outside the scope of this post, the below should point you in the right direction.

Riak is often used by applications and companies with a primary background in relational databases, such as MySQL. Most people who move from a relational database to Riak cite a few reasons. For one, Riak’s masterless, fault-tolerant, read/write available design make it a better fit for data that must be highly available and resilient to failure scenarios. Second, Riak’s operational profile and use of consistent hashing means data is automatically redistributed as you add machines, avoiding hot spots in the database and manual resharding efforts. Riak is also chosen over relational databases for the multi-datacenter capabilities provided in Riak Enterprise. A more detailed look at the difference between Riak and traditional databases and how to make the switch can be found in this whitepaper, From Relational to Riak.

A more detailed look at the technical differences between Riak and other NoSQL databases can be found in the comparisons section of our documentation, which covers databases such as MongoDB, Couchbase, Neo4j, Cassandra, and others.

Ready to get started? You can download Riak here. For more in-depth information about Riak, we also offer Riak Workshops in New York and San Francisco. Learn more here.

Basho

Building Gaming Applications and Services with Riak

March 13, 2013

For a complete overview, download the whitepaper, “Gaming on Riak: A Technical Introduction.” To see how other gaming companies are using Riak, visit us at the Game Developers Conference at Booth #202!

As discussed in our previous post, “Gaming on Riak: A Brief Overview and User Case Studies,” Riak can provide a number of advantages for gaming platforms. Content agnostic, an HTTP API, many client libraries, and a simple key/value data model, Riak is a flexible data store that can be used for a variety of different use cases in the gaming industry. This post looks at some common examples and how to start building them in Riak.

Use Cases

Player and Session Data: Riak can serve and store key player and session data with predictable low latency, and ensures it is available even in the event of node failure and network partition. This data may include user and profile information, game performance, statistics and rankings, and more. In Riak, all objects are stored on disk as binaries, providing flexible storage for many content types. Since Riak is schema-less, applications can evolve without changing an underlying schema, providing agility with growth and change.

Social Information: Riak can be used for social content such as social graph information, player profiles and relationships, social authentication accounts, and other types of social gaming data.

Content: Riak is often used to store text, documents, images, videos and other assets that power gaming experiences. This data often needs to be highly available and able to scale quickly to attract and keep users.

Global Data Locality: Gaming requires a low-latency experience, no matter where the players are located. Riak Enterprise’s multi-datacenter replication feature means data can be served to global users quickly.

Data Model

Below are some common approaches to structuring gaming data with Riak’s key/value model:

Riak offers robust additional functionality on top of the fundamental key/value model. For more information on these options as well as how to implement them, their architecture, and their limitations, check out the documentation on searching and accessing data in Riak.

Riak Search
Riak Search is a distributed, full-text search engine. It provides support for various MIME types & analyzers, and robust querying including exact matches, wildcards, range queries, proximity searches, and more.
Possible Use Cases: Searching player and game information.

Secondary Indexing
Secondary Indexing (2i) gives developers the ability, at write time, to tag an object stored in Riak with one or more queryable values. Indexes can be either integers or strings and can be queried by either exact matches or ranges of an index.
Possible Use Cases: Tagging player information with user relationships, general attributes, or other metadata.

MapReduce
Developers can leverage MapReduce for analytic and aggregation tasks. It offers support for both JavaScript and Erlang MapReduce.
Possible Use Cases: Filtering game data by tag, counting items, and extracting links to related data.

To learn more about how your gaming platform can benefit from Riak, download “Gaming on Riak: A Technical Introduction.” For more information about Riak, sign up for our webcast on Thursday, March 14.

Basho

Introducing Riak 1.3: Active Anti-Entropy, a New Look for Riak Control, and Faster Multi-Datacenter Replication

February 21, 2013

Today we are excited to announce the latest version of Riak. Here is a summary of the major enhancements delivered in Riak 1.3:

  • Introduced Active Anti-Entropy. Riak now has active anti-entropy. In distributed systems, inconsistencies can arise between replicas due to failure modes, concurrent updates, and physical data loss or corruption. Pre-1.3 Riak already had several features for repairing this “entropy”, but they all required some form of user intervention. Riak 1.3 introduces automatic, self-healing properties that repair entropy on an ongoing basis.
  • Improved Riak Enterprise’s multi-datacenter replication performance. New advanced mode for multi-datacenter replication capabilities, with better performance, more TCP connections and easier configuration. Read more in this write up from GigaOM.
  • Improved graphical user experience. Riak Control, the user interface for managing and monitoring Riak, has a brand new look.
  • Expanded IPv6 support. IPv6 support in Riak now is supported by all interfaces.
  • Improved MapReduce. Riak MapReduce has improved back-pressure to reduce the risk of overwhelming endpoint processes during large tasks.
  • Simplified log management. Riak can now optionally send log messages to syslog.

Ready to get started or upgrade? Download the new release here, check out the official release notes, or read on for more details. Documentation for all products and releases is available on the documentation site. For an introduction to Riak and what’s new in Riak 1.3, sign up for our webcast on Thursday, March 7.

More on What’s in Riak 1.3

Active Anti-Entropy
A key feature of Riak is its ability to regenerate lost or corrupted data from replicated data stored on other nodes. Prior to this release, Riak provided two methods to repair data:

  • Read Repair: Riak compares the replies from all replicas during a read request, repairing any replica that is divergent or missing data. (K/V data only)
  • Repair Command via Riak Console: Introduced in Riak 1.2, the repair command enables users to trigger a repair of a specific partition. The partition is rebuilt based on a subset of data stored on adjacent nodes in the Riak ring. All data is rebuilt, not just missing or divergent data. (K/V and Search data)

Riak 1.3 introduces active anti-entropy, a continuous background process that compares and repairs any divergent, missing, or corrupted replicas (K/V data only). Unlike read repair, which is only triggered when data is read, the active anti-entropy system ensures the integrity of all data stored in Riak. This is particularly useful in clusters containing “cold data”: data that may not be read for long periods of time, potentially years. Furthermore, unlike the repair command, active anti-entropy is an automatic process, requiring no user intervention and is enabled by default in Riak 1.3.

Riak’s active anti-entropy feature is based on hash tree exchange, which enables differences between replicas to be determined with minimal exchange of information. Specifically, the amount of information exchanged in the process is proportional to the differences between two replicas, not the amount of data that they contain. Approximately the same amount of information is exchanged when there are 10 differing keys out of 1 million keys as when there are 10 differing keys out of 10 billion keys. This enables Riak to provide continuous data protection regardless of cluster size.

Additionally, Riak uses persistent, on-disk hash trees rather than purely in-memory trees, a key difference from similar implementations in other products. This allows Riak to maintain anti-entropy information for billions of keys with minimal additional memory usage, as well as allows Riak nodes to be restarted without losing any anti-entropy information. Furthermore, Riak maintains the hash trees in real time, updating the tree as new write requests come in. This reduces the time it takes Riak to detect and repair missing/divergent replicas. For added protection, Riak periodically (default: once a week) clears and regenerates all hash trees from the on-disk K/V data. This enables Riak to detect silent data corruption to the on-disk data arising from bad disks, faulty hardware components, etc.

New Look for Riak Control
Riak Control is a UI for managing and monitoring your Riak cluster. Riak Control lets you start and re-start Riak nodes, view a “health check” for your cluster, see all nodes and their current status, and have visibility into their partitions and services. Riak Control now has a brand new look and feel. Check out the Riak Control Github page to get up and running.

Expanded IPv6 Support
While Riak’s HTTP interface has always supported IPv6, not all of its interfaces have been as current. In Riak 1.3, the protocol buffers interfaces can now listen on IPv6 or IPv4 addresses. Riak handoff (which is responsible for data transfer when nodes are added or removed, and for handing off update responsibilities when nodes fail) also supports IPv6. It should also be noted that community member Tom Lanyon started the work on this feature. Thanks, Tom!

Improved Backpressure in Riak MapReduce
Riak has Javascript and Erlang MapReduce for performing aggregation and analytics tasks. Backpressure is an important aspect of the MapReduce system, keeping processes from being overwhelmed or memory consumption getting out of control. In Riak 1.3, tunable backpressure is extended to the MapReduce sink to prevent these types of problems at endpoint processes.

Riak Enterprise: Advanced Multi-Datacenter Replication Capabilities
With hundreds of companies using Riak Enterprise, a commercial extension of Riak, we’ve been lucky to work with many teams pushing the limits of multi-datacenter replication performance and resiliency. We’ve learned a lot and are excited to announce these capabilities are now available in advanced mode.

  • Previously, multi-datacenter replication had one TCP connection over which data was streamed from one cluster to another. This could create a performance bottleneck, especially when run on nodes constrained by per-instance bandwidth limits, such as in a cloud environment. In the new version of multi-datacenter replication, multiple concurrent TCP connections (approximately one per physical node) and processes are used between sites.
  • Configuration of multi-datacenter replication is easier. Use a shell command to name your clusters, then connect both clusters using a simple ip:port combination.
  • Better per-connection statistics for both full-sync and real-time modes.
  • New ability to tweak full-sync workers per node and per cluster, allowing customers to dial-in performance.

The new replication improvements are already used in production by customers and yielding significant performance improvements. For now, the new replication technology is available in advanced mode: it’s optional to turn on. It currently doesn’t have all of the features of the default mode – including SSL, NAT support and full-sync scheduling. Both default and advanced modes are available in the 1.3 release and function independently. In the future, “advanced mode” will become the default.

For more details about multi-datacenter replication, download our whitepaper, “Multi-Datacenter Replication: A Technical Overview.”

Basho

Building Retail and eCommerce Services with Riak

January 31, 2013

This is the second in a series of blog posts covering Riak for retail and eCommerce platforms. To learn more, join our “Retail on Riak” webcast on Friday, February 8th or download the “Riak for Retail” whitepaper.

In our last post, we looked at three Riak users in the eCommerce/retail space. In this post, we will look at some common use cases for Riak and how to start building them with Riak’s key/value model and querying features.

Use Cases

  • Shopping Carts: Riak’s focus on availability makes it attractive to retailers offering shopping carts and other “buy now” functionality. If the shopping cart is unavailable, loses product additions, or responds slowly to users, it has a direct impact on revenue and user trust.
  • Product Catalogs: Retailers need to store anywhere from thousands to tens of thousands of inventory items and associated information – such as photos, descriptions, prices, and category information. Riak’s flexible, fast storage makes it a good fit for this type of data.
  • User Information: As mobile, web, and multi-channel shopping become more social and personalized, retailers have to manage increasing amounts of user information. Riak scales efficiently to meet increased data and traffic needs and ensures user data is always available for online shopping experiences.
  • Session Data: Riak provides a highly reliable platform for session storage. User/session IDs are usually stored in cookies or otherwise known at lookup time, a good fit for Riak’s key/value mode.

Data Modeling

In Riak, objects are comprised of key/value pairs, which are stored in flat namespaces called “buckets.” Riak is content-type agnostic, and stores all objects on disk as binaries, giving retailers lots of flexibility to store anything they want. Here are some common approaches to modeling the data and services discussed above in Riak:

Querying

Riak provides several features for querying data:

Riak Search: Riak Search is a distributed, full-text search engine. It provides support for various MIME types & analyzers, and robust querying.
Possible Use Cases: Searching product information or product descriptions.

Secondary Indexing: Secondary Indexing (2i) gives developers the ability, at write time, to tag an object stored in Riak with one or more values, queryable by exact matches or ranges of an index.
Possible Use Cases: Tagging products with categories, special promotion identifiers, date ranges, price or other metadata.

MapReduce: Riak offers MapReduce for analytic and aggregation tasks with support for JavaScript and Erlang.
Possible Use Cases: Filtering product information by tag, counting items, and extracting links to related products.

Check out our docs for more information on building applications and services with Riak.

For more details and examples of common Riak use cases, register for our “Retail on Riak” webcast on February 8th or download the “Riak for Retail” whitepaper.

Basho

A Fresh Start For riak-js

November 13, 2012

Today I’m happy to announce a new release of riak-js, the Riak client for Node.js. This release breathes new life back into riak-js. For various reasons, development on riak-js had been dormant for quite a while, but that has changed for the better and we’re committed to making it a viable client library for production applications.

Ancient History

riak-js has been around for a quite a while. The first commit by its original author, Francisco Treacy, dates back to March 2010. The original implementation was written in JavaScript and running on Node 0.1. Subsequent versions were rewritten in CoffeeScript, until Francisco wanted a clean start. CoffeeScript is a great language, but shipping compiled code to the user meant increased headaches during debugging and a slightly increased entrance barrier for the JavaScript community.

So he started a complete rewrite in JavaScript more than a year ago. It progressed quite nicely, but hit a wall when Frank was looking for a new maintainer for the library. Thankfully the guys from mostlyserious picked up the torch, and eventually I joined them in the effort to finalize the library into a new release.

During the transition, riak-js got a new home as well, and can now be found on mostlyserious/riak-js. Please make sure to send all issues an pull requests to this repository.

The Present

Version 0.9.0 of riak-js was shipped today and can be installed from npmjs.org:

While the new release is mostly backwards-compatible, it does come with some changes that I deemed worthwhile to make before hitting the big 1.0.

Most notably, the functions for using MapReduce and Riak Search have moved into their own namespace. That brings riak-js more on par with other libraries in making both functionalities a bit more separate from normal client operations.

To run MapReduce, you now use something like the following example:

The same goes for Riak Search, which is now fully supported, including adding, removing and querying documents directly from a search index:

There are other minor changes, e.g. accessing bucket properties:

To get a good overview of the current API of riak-js, check out the documentation.

Back To The Future

Development on future versions has already begun, and we still have a good list of things to work on for future releases.

Most notable, Protocol Buffer support is going to return. For simplicity reasons, it was removed from the JavaScript implementation. It will be back!

I’ve also started adding instrumentation to the HTTP API operations for easier tracking of metrics and logging. But more on that in another blog post.

If there’s anything you’d like to see in future version, open an issue, send a pull request or start a discussion on the Riak mailing list!

Mathias

Building Apps on Riak – Content, Sessions, User Data, Ads and Other Use Cases

October 30, 2012

How do I build my application on Riak? This is one of the first questions we get from people new to Riak. Whether you are new to the key/value model, switching from a relational database, or building a new application on Riak, we’ve got a new section of the docs to help you get started.

Learn about data models for common application types – including high read/write use cases (session storage, ad platforms, log and sensor data) and apps that require relationship modeling (content-serving applications and user accounts, settings, and events). While not meant to be a prescriptive guide, this section walks through common simple and complex approaches to building apps using Riak’s key/value design and features such as MapReduce, search and secondary indexing. You can also check out the stories of users like Voxer, Clipboard and Yammer to learn more about how they built their apps on Riak.

This is just the start of a bigger effort to help users better understand how to build amazing applications on Riak. We will be adding code examples, other application types, additional considerations, and more stories from the user community. We’d love to add your contributions, so please submit a pull request or file an issue on our Github repo if you have anything to add or anything you’d like to know more about.

Basho Team

Slides – Searching and Accessing Data in Riak

September 24, 2012

Last week we aired a live broadcast from our sunny San Francisco offices on searching and accessing data in Riak. This is one of the top things we get questions about – especially from people moving from databases that use other data models or people building new applications on top of Riak.

Slides are below – unfortunately we had some audio quality issues with the video and need to rerecord it for upload. The webinar covers Riak’s full-text search, object tagging and Map Reduce capabilities including:

  • use cases and features
  • when NOT to use
  • query examples
  • user scenarios
  • high-level architecture and configuration
  • hybrid architectures

Check out slides below and make sure to read our docs on the subject.

SLIDES

In the talk, we mention a blog post from Riak user Clipboard on how they optimized Riak Search for their application, which is an awesome way to save and share online content. You can read that blog post here.

Protobuffs in Riak 1.2

July 18, 2012

You might remember that back in April, we sent around a survey to get input about what features developers use and want in Riak clients. All in all, we had about 87 developers respond to the survey.

One of the questions in that survey — and the one that was the most interesting to me — asked the respondent to rank some potential features for the roadmap. At the top of that list in the results was to support Secondary Index (2I) and Riak Search queries natively from the Protocol Buffers (Protobuffs) interface. You could already query them by sending a MapReduce request, but the additional step was confusing for some, and slow for others. I set out to make these features happen for the Riak 1.2 release.

Coupling challenges

Originally, the Protobuffs interface was created in an effort to satisfy a customer’s specific performance issue with HTTP, back around Riak version 0.10 or so. It seemed to work well for others, too, and so it got merged into the mainline. From that point until 1.0, not much was done with it. In Riak 1.0, it got a slew of new options — especially enhancments to Key-Value operations like get, put, and delete — that brought it closer to feature-parity with the HTTP interface.

Now, simply adding 2I queries to the existing system would have been straightforward, but search queries would not have been so. Why?

  • While the HTTP interface of Riak has always been built atop Webmachine, making it easy to
    add new resources as needed, the Protobuffs components were part of riak_kv. In fact, the Protobuffs interface was created while riak_search was still in its infancy, and when we had little idea what its interface would look like. Adding a coupling back the other direction (from riak_kv to riak_search) might just make the problem worse.
  • The riak-erlang-client was a dependency of riak_kv so that they could share the riakclient.proto file that contained all of the protocol message definitions. This made the Riak codebase potentially brittle to changes in the client library and made it necessary to copy the riakclient.proto file to our other clients that generate code from it.
  • We were using an antiquated version of the erlang_protobuffs library that we had forked and not kept up-to-date. The new maintainer had added features like extensions that we would like to use in the future. If I recall correctly, our version didn’t even properly support enumerations.

Refactoring

With those problems in mind and with the help of a few of my fellow Engineers, I set out to refactor the entire thing. Here’s what we came up with.

First, we separated the connection management from the message processing. This is a bit like how Webmachine works, where the accepting (mochiweb) and dispatching (webmachine) of an incoming HTTP message is separate from processing the message (your resource module and the decision graph). The result of our refactoring is the new riak_api OTP
application. It consists of a TCP listener, server processes that are spawned for each connection, and a registration utility for defining your own message handlers which are called “services”. Here’s how riak_kv registers its services:

erlang
riak_api_pb_service:register([{riak_kv_pb_object, 3, 6}, %% ClientID stuff
{riak_kv_pb_object, 9, 14}, %% Object requests
{riak_kv_pb_bucket, 15, 22}, %% Bucket requests
{riak_kv_pb_mapred, 23, 24}, %% MapReduce requests
{riak_kv_pb_index, 25, 26} %% Secondary index requests
])

Each service, represented as a module that implements the riak_api_pb_service behaviour, specifies a range of message codes it can handle. When an incoming message with a registered message code is received, it is dispatched to the corresponding service module, which can then do some processing and decide what messages to send back to the client.

Second, we separated the Protobuffs message definitions from the Erlang client library. We put the .proto file in a new library application called riak_pb, and actually split it out into several files, grouped by the component of the server they represent; this means there’s a riak.protoriak_kv.proto, and riak_search.proto. In addition to removing the coupling between the Erlang client and the server, we now have a project whose only responsibility is to describe the messages of the protocol. It’s like the equivalent of an RFC, but in code! In the near future we will have build targets in the project that let us generate Java or Python shims from the included messages and that we can distribute as standalone .jar and .egg files.

Third, we merged upstream changes from the new erlang_protobuffs maintainer and made some updates of our own. In addition to the features like extensions, the newer version has a more complete test suite. Our own updates fixed some bugs and edge cases in the library so that we could improve the overall experience for users. For example, when encountering an unknown message field, the TCP connection will no longer close because of a decoding error; instead, the unknown field will just be ignored.

New features

Whew, that was a lot of work just to get to good stuff! With the updated code structure and a plan with how to move forward, we added two new services, one in riak_kv (supporting native 2I) and one in riak_search (supporting native search-index queries), and four new messages to riak_pb to support those services. We decided not to expose the “add to index” or “delete from index” features in riak_search because we want to take it in a direction that focuses on indexing KV data rather maintaining a separate index-management interface. If you’re already using the “search KV hook” to index your data, you’ll be fine.

Client-side support for these new requests and responses has already landed in the Ruby client and will soon be landing in JavaErlang, and Python. You can track support for the new features on our updated Client Libraries wiki page.

Roadmap

Those two new client-facing features are great, but the survey showed us a lot more about what you want and need from Riak’s interfaces. For future releases we’ll be investigating how to improve Protobuff’s error messages and support for bucket properties, how to expose bulk or asynchronous operations, and much more.

Keep using Riak and sending us great feedback!

Sean