Tag Archives: Riak Search

Hangouts with Basho

January 29, 2014

On Fridays, Basho hosts a Hangout to discuss various topics related to Riak and distributed systems. While Basho evangelists and engineers lead these live Hangouts, they also bring in experts from various other companies, including Kyle Kingsbury (Fatcual), Jeremiah Peschka (Brent Ozar Unlimited), and Stuart Halloway (Datomic).

If you haven’t attended a Hangout, we have recorded them all and they are available on the Basho Technologies Youtube Channel. You can also watch each below.

Data Types and Search in Riak 2.0

Featuring Mark Phillips (Director of Community, Basho), Sean Cribbs (Engineer, Basho), Brett Hazen (Engineer, Basho), and Luke Bakken (Client Services Engineer, Basho)

Bucket Types and Configuration

Featuring Tom Santero (Technical Evangelist, Basho), Joe DeVivo (Engineer, Basho), and Jordan West (Engineer, Basho)

Riak 2.0: Security and Conflict Resolution

Featuring John Daily (Technical Evangelist, Basho), Andrew Thompson (Engineer, Basho), Justin Sheehy (CTO, Basho), and Kyle Kingsbury (Factual)

Fun with Java and C Clients

Featuring Seth Thomas (Technical Evangelist, Basho), Brett Hazen (Engineer, Basho), and Brian Roach (Engineer, Basho)

Property Based Testing

Featuring Tom Santero (Technical Evangelist, Basho) and Reid Draper (Engineer, Basho)

Datomic and Riak

Featuring Hector Castro (Technical Evangelist, Basho), Dmitri Zagidulin (Professional Services, Basho), and Stuart Halloway (Datomic)

CorrugatedIron

Featuring John Daily (Technical Evangelist, Basho), David Rusek (Engineer, Basho), and Jeremiah Peschka (Brent Ozar Unlimited)

A Look Back

Featuring John Daily (Technical Evangelist, Basho), Hector Castro (Technical Evangelist, Basho), Andy Gross (Chief Architect, Basho), and Mark Phillips (Director of Community, Basho)

Hangouts take place on Fridays at 11am PT/2pm ET. If you have any topics you’d like to see featured, let us know on the Riak Mailing List.

Basho

Riak Development Anti-Patterns

January 7, 2014

Writing an application that can take full advantage of Riak’s robust scaling properties requires a different way of looking at data storage and retrieval. Developers who bring a relational mindset to Riak may create applications that work well with a small data set but start to show strain in production, particularly as the cluster grows.

Thus, this looks at some of the common conceptual challenges.

Dynamic Querying

Riak offers query features such as secondary indexes (2i), MapReduce, and full-text search, but throwing a large quantity of data into Riak and expecting those tools to find whatever you need is setting yourself (and Riak) up to fail. Performance will be poor, especially as you scale.

Reads and writes in Riak should be as fast with ten billion values in storage as with ten thousand.

Key/value operations seem primitive (and they are) but you’ll find they are flexible, scalable, and very fast (and predictably so).

Treat 2i and friends as tools to be applied judiciously, design the main functionality of your application as if they don’t exist, and your software will continue to work at blazing speeds when you have petabytes of data stored across dozens of servers.

Normalization

Normalizing data is generally a useful approach in a relational database, but unlikely to lead to happy results with Riak.

Riak lacks foreign key constraints and join operations, two vital parts of the normalization story, so reconstructing a single record from multiple objects would involve multiple read requests; certainly possible and fast enough on a small scale, but not ideal for larger requests.

Instead, imagine the performance of your application if most of your requests were a single, trivial read. Preparing and storing the answers to queries you’re going to ask later is a best practice for Riak.

Ducking Conflict Resolution

One of the first hurdles Basho faced when releasing Riak was educating developers on the complexities of eventual consistency and the need to intelligently resolve data conflicts.

Because Riak is optimized for high availability, even when servers are offline or disconnected from the cluster due to network failures, it is not uncommon for two servers to have different versions of a piece of data.

The simplest approach to coping with this is to allow Riak to choose a winner based on timestamps. It can do this more effectively if developers follow Basho’s guidance on sending updates with vector clock metadata to help track causal history, but often concurrent updates cannot be automatically resolved via vector clocks, and trusting server clocks to determine which write was the last to arrive is a terrible conflict resolution method.

Even if your server clocks are magically always in sync, are your business needs well-served by blindly applying the most recent update? Some databases have no alternative but to handle it that way, but we think you deserve better.

Riak 2.0, when installed on new clusters, will default to retaining conflicts and requiring the application to resolve them, but we’re also providing replicated data types to automate conflict resolution on the servers.

If you want to minimize the need for conflict resolution, modeling with as much immutable data as possible is a big win.

Mutability

For years, functional programmers have been singing the praises of immutable data, and it confers significant advantages when using a distributed data store like Riak.

Most obviously, conflict resolution is dramatically simplified when objects are never updated.

Even in the world of single-server database servers, updating records in place carries costs. Most databases lose all sense of history when data is updated, and it’s entirely possible for two different clients to overwrite the same field in rapid succession leading to unexpected results.

Some data is always going to be mutable, but thinking about the alternative can lead to better design.

SELECT * FROM <table>

A perfectly natural response when first encountering a populated database is to see what’s in it. In a relational database, you can easily retrieve a list of tables and start browsing their records.

As it turns out, this is a terrible idea in Riak.

Riak is optimized for unstructured, opaque data; however, it is not designed to allow for trivial retrieval of lists of buckets (very loosely analogous to tables) and keys.

Doing so can put a great deal of stress on a large cluster and can significantly impact performance.

It’s a rather unusual idea for someone coming from a relational mindset, but being able to algorithmically determine the key that you need for the data you want to retrieve is a major part of the Riak application story.

Large Objects

Because Riak sends multiple copies of your data around the network for every request, values that are too large can clog the pipes, so to speak, causing significant latency problems.

Basho generally recommends 1-4MB objects as a soft cap; larger sizes are possible with careful tuning, however.

For significantly larger objects, Riak CS offers an Amazon S3-compatible (and also OpenStack Swift-compatible) key/value object store that uses Riak under the hood.

Running a Single Server

This is more of an operations anti-pattern, but it is a common misunderstanding of Riak’s architecture.

It is quite common to install Riak in a development environment using its devrel build target, which creates five full Riak stacks (including Erlang virtual machines) to run on one server to simulate a cluster.

However, running Riak on a single server for benchmarking or production use is counterproductive, regardless of whether you have one stack or five on the box.

It is possible to argue that Riak is more of a database coordination platform than a database itself. It uses Bitcask or LevelDB to persist data to disk, but more importantly, it commonly uses at least 64 such embedded databases in a cluster.

Needless to say, if you run 64 databases simultaneously on a single filesystem you are risking significant I/O and CPU contention unless the environment is carefully tuned (and has some pretty fast disks).

Perhaps more importantly, Riak’s core design goal, its raison d’être, is high availability via data redundancy and related mechanisms. Writing three copies of all your data to a single server is mostly pointless, both contributing to resource contention and throwing away Riak’s ability to survive server failure.

So, Now What?

As always, we recommend visiting Basho’s docs website for more details on how to build and run Riak, and many of our customers have given presentations on their use cases of Riak, including data modeling.

Also, keep an eye on the Basho blog where we provide high-level overviews like this of Riak and the larger non-relational database world.

For a detailed analysis of your needs and modeling options, contact Basho regarding our professional services team.

Further Reading

John Daily

RICON West Videos: Riak Search 2.0

December 17, 2013

In addition to Riak Data Types, there were a number of other presentations about Riak 2.0 features at RICON West. With the Technical Preview of Riak 2.0, we also announced a completely redesigned Riak Search.

Riak is a straight key/value data store and all objects are stored on disk as binaries. It is content agnostic – meaning you can store any type of data as the value in Riak. To improve the usability and functionality of Riak, we offer multiple querying options including Riak Search, Secondary Indexing, and MapReduce. Riak Search is a full-text search that allows Riak developers to index the contents of stored values. While Riak Search offers much needed functionality, it had its flaws.

In Riak 2.0, Riak Search received a complete overhaul. Riak Search 2.0 leverages the Apache Solr full-text document indexing engine directly. Riak users now get the power of Solr, with the availability and scalability of Riak. This upgrade also supports the Solr client-queries API, which enables integration with existing software solutions.

Eric Redmond is one of the Basho engineers who works on Riak Search. At RICON, he presented “Riak Search 2.0,” which walks through what’s new with Riak Search and why you’d want to use it. He also provides some impressive demos that show off the power of Solr and Riak. His full talk is below.

For more information on Riak Search 2.0, check out these resources on Github.

To watch all of the sessions from RICON West 2013, visit the Basho Technologies Youtube Channel.

Basho

Relational to Riak – Tradeoffs

November 18, 2013

This series of blog posts will discuss how Riak differs from traditional relational databases. For more information about any of the points discussed, download our technical overview, “From Relational to Riak.” The previous post in the series discussed High Availability and Cost of Scale.


Eventual Consistency

In order to provide high availability, which is a cornerstone of Riak’s value proposition, the database stores several copies of each key/value pair.

This availability requirement leads to a fundamental tradeoff: in order to continue to serve requests in the presence of failure, we do not force all data in the cluster to stay in sync. Riak will allow writes and reads no matter how many servers (and their stored replicas) are offline or otherwise unreachable.

(Incidentally, this lack of strong coordination has another consequence beyond high availability: Riak is a very, very fast database.)

Riak does provide both active and passive self-healing mechanisms to minimize the window of time during which two servers may have different versions of data.

The concept of eventual consistency may seem unfamiliar, but if you’ve ever implemented a cache or used DNS, those are common examples of the idea. In a large enough system, it’s effectively the default state of all data.

However, with the forthcoming release of Riak 2.0, operators will be able to designate selected pieces of data to require coordination and maintain strong consistency over high availability. Writing such data will be slower and subject to failure if too many servers are unreachable, but the overall robust architecture of Riak will still provide a fast, highly available solution.

Data Modeling

Riak stores data using a simple key/value model, which offers developers tremendous flexibility to define access models that suit their applications. It is also content-agnostic, so developers can store arbitrary data in any convenient format.

Instead of forcing application-specific data structures to be mapped into (and out of) a relational database, they can simply be serialized and dropped directly into Riak. For records that will be frequently updated, if some of the fields are immutable and some aren’t, we recommend keeping the immutable data in one key/value pair and the rest organized into a single or multiple objects based on update patterns.

Relational databases are ingrained habits for many of us, but moving beyond them can be liberating. Further information about data modeling, including sample configurations, are available on Use Cases section of the documentation.

Tradeoffs

One tradeoff with this simpler data model is that there is no SQL or SQL-like language with which to query the data.

To achieve optimal performance, it is advisable to take advantage of the flexibility of the key/value model to define simple retrieval patterns. In other words, determine the most useful queries and write the results of those queries as the data is being processed.

Because it is not always possible to know in advance what questions will need to be asked of your data, Riak offers added functionality on top of the key/value model. Tools such as Riak Search (a distributed, full-text search engine), Secondary Indexing (ability to tag objects with queryable metadata), and MapReduce (leveraged for aggregation tasks) are available to perform ad hoc queries as needed.

For many users, the tradeoffs of moving to Riak are worthwhile due to the overall benefits; however, it can be a bit of an adjustment. To see why others have chosen to switch to Riak from both relational systems and other NoSQL databases, check out our Users Page.

Basho

A Weekly Hangout With Basho

November 11, 2013

Last Friday, the Basho team held our inaugural Riak Community Hangout.

This 30 minute session is a development focused conversation with topics changing weekly. The Hangout is planned for most Fridays at 11am Pacific/2pm Eastern/7pm GMT, with the URL published shortly before it begins. All Hangouts will be archived and hosted on the Basho Technologies Youtube channel. You should follow @basho for all updates about future Hangouts.

Over the next few weeks, these Hangouts will focus on the new features planned for Riak 2.0.

The first session was hosted by Basho’s Director of Community, Mark Phillips, who discussed Riak Data Types and Riak Search 2.0 with Basho engineers Sean Cribbs, Brett Hazen, and Luke Bakken.

The Hangout began with an overview of Riak Data Types, available with the 2.0 Technical Preview, and examined their implementation, use cases, and implementation considerations. Following this (at 18 minutes, 35 seconds), Brett Hazen provided an overview of Riak Search 2.0 (codenamed Yokozuna) and Luke Bakken queried a portion of the Twitter stream on a cluster running the newest Riak Search 2.0 code.

Upcoming sessions will focus on Riak/Riak CS internals, application building, data modeling, and community requested topics. We are also looking for community members to join in and highlight what you’re building with Riak and Riak CS developers

If you have questions or topics you would like to hear discussed, reach out on the message list, in IRC (#riak on irc.feenode.net), or contact us.

Basho

Basho Announces Technical Preview of Riak 2.0 and Riak Enterprise 2.0

San Francisco, CA – October 29, 2013 – Today at RICON West, Basho, the worldwide leader in distributed systems and cloud storage software, announced that the Technical Preview of Riak 2.0, Basho’s distributed NoSQL database, is now publicly available. This major release introduces new features that improve developer ease-of-use, increase flexibility around consistency, boost search and analytics capabilities, simplify operations at scale, and provide enterprise-class data security.

Riak continues to gain adoption worldwide supporting critical applications that require high-availability, predictable scalability, and performance. Riak’s unique ability to distribute data, both to ensure availability and provide data locality, provides enterprises a proven database technology for powering critical web, mobile and social applications, cloud computing platforms, and to store and serve machine-to-machine and sensor data. Riak is used by thousands of companies, including over 30% of the Fortune 50.

New Features in Riak 2.0

  • Riak Data Types. Riak 2.0 includes a range of flexible, distributed data types, that greatly simplify application development without sacrificing Riak’s availability and partition tolerance characteristics. Available Riak data types include distributed counters, sets, maps, registers, and flags.
  • Strong Consistency. Developers now have the flexibility to choose whether buckets should be eventually consistent (the default Riak configuration today that provides high availability) or strongly consistent, based on data requirements.
  • Full-Text Search Integration with Apache Solr. Riak Search is completely redesigned in Riak 2.0, leveraging the Apache Solr engine. Riak Search in 2.0 fully supports the Solr client query APIs, enabling integration with a wide range of existing software and commercial solutions.
  • Security. Riak 2.0 adds the ability to administer access rights and utilize plug-in authentication models. Authentication and Authorization is provided via client APIs.
  • Simplified Configuration Management. Riak 2.0 continues to improve Riak’s operational simplicity by changing how, and where, configuration information is stored in an easy-to-parse and transparent format.
  • Reduced Replicas for Secondary Sites. Exclusive to Riak Enterprise 2.0, users can now optionally store fewer copies of replicated data across multiple datacenters to better maintain a balance between storage overhead and availability.

Technical Preview Availability

Download the Riak 2.0 Technical Preview here. All code for Riak 2.0 is also available on Github. For more details on the technical preview for Riak, visit our blog.

About RICON West
RICON West 2013 is part of the RICON conference series. RICON is Basho’s distributed systems conference for developers and academics. RICON West will take place in San Francisco, CA on October 29-30. More than 25 speakers will discuss applications, use cases, and the future of distributed systems – including NoSQL solutions and cloud storage. RICON West 2013 speakers include Basho, Google, Microsoft Research, Netflix, salesforce.com, Seagate, State Farm Insurance, The Weather Company, and Twitter. RICON West 2013 is sold-out; however, Basho will offer a live stream.

About Basho
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed database, Riak, and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by over 25 percent of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.

Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com. Basho is headquartered in Cambridge, Massachusetts and has offices in London, San Francisco, Tokyo and Washington DC.

Introducing Riak 2.0: Data Types, Strong Consistency, Full-Text Search, and Much More

October 29, 2013

Today at RICON West in San Francisco, we announced the Technical Preview of Riak 2.0 is now available. This major release adds a number of new features that many of you have been waiting for.

Throughout RICON West, we will be discussing many of the Riak 2.0 features (both in track sessions or during lightning talks), so keep your eyes on the live stream over the next two days. Videos of all sessions will also be made available after the conference.

Here is a look at some of the major enhancements available in Riak 2.0:

  • Riak Data Types. Building on the eventually consistent counters introduced in Riak 1.4, Riak 2.0 adds sets and maps as new distributed data types. These Riak Data Types simplify application development without sacrificing Riak’s availability and partition tolerance characteristics.
  • Strong Consistency. Developers have the flexibility to choose whether buckets should be eventually consistent (the default Riak configuration today that provides high availability) or strongly consistent, based on data requirements.
  • Full-Text Search Integration with Apache Solr. Riak Search is completely redesigned in Riak 2.0, leveraging the Apache Solr engine. Riak Search in 2.0 supports the Solr client query APIs, enabling integration with a wide range of existing software and commercial solutions.
  • Security. Riak 2.0 adds the ability to administer access rights and utilize plug-in authentication models. Authentication and Authorization is provided via client APIs.
  • Simplified Configuration Management. Riak 2.0 continues to improve Riak’s operational simplicity by changing how, and where, configuration information is stored in an easy-to-parse and transparent format.
  • Reduced Replicas for Multiple Data Centers. Riak Enterprise 2.0 can optionally store fewer copies of replicated data across multiple data centers to better maintain a balance between storage overhead and availability.

Ready to get started? Download the Technical Preview.

Please note that this is only a Technical Preview of Riak 2.0. This means that it has been tested extensively, as we do with all of our release candidates, but there is still work to be completed to ensure it’s production hardened. Between now and the final release, we will be continuing manual and automated testing, creating detailed use cases, gathering performance statistics, and updating the documentation for both usage and deployment.

As we are finalizing Riak 2.0, we welcome your feedback for our Technical Preview. We are always available to discuss via the Riak Users mailing list, IRC (#riak on freenode), or contact us.

Riak 2.0 Technical Preview: Deep Dive

Riak Data Types
In distributed systems, we are forced to trade consistency for availability (see: CAP Theorem) and this can complicate some aspects of application design. In Riak 2.0, we have integrated cutting-edge research on data types known as called CRDTs (Conflict-Free Replicated Data Types) pioneered by INRIA to create Riak Data Types. By adding counters, sets, maps, registers, and flags, these Riak Data Types enable developers to spend less time thinking about the complexities of vector clocks and sibling resolution and, instead, focusing on using familiar, distributed data types to support their applications’ data access patterns.

A more detailed overview of Riak Data Types is available that examines implementation considerations and the basics of usage.

Strong Consistency
In all prior versions, Riak was classified as an eventually consistent system. With the 2.0 release, Riak now lets developers choose when operations should be strongly or eventually consistent. This gives developers a choice between these semantics for different types of data. At the same time, operators can continue to enjoy the operational simplicity of Riak. Consistency preferences are defined on a per bucket type basis, in the same cluster.

A RICON West 2012 talk entitled, Bringing Consistency to Riak, shares much of the initial thinking behind this effort. In addition, the pull request that adds consistency to riak_kv provides detailed information about related repositories and the implementation approach.

Redesigned Full-Text Search
Riak is a key/value store and the values are simply stored on disk as binary. With previous versions of Riak Search, Riak developers have long been able to index the content of these stored values. In Riak 2.0, Riak Search (code-named Yokozuna) has been completely redesigned and now uses the Apache Solr full-text document indexing engine directly. Together, Riak and Solr provide a reliable full-text context indexing solution that is highly available and built for scale. In addition, Riak Search 2.0 also fully supports the Solr client query APIs, which enables integration with existing software solutions (either homegrown or commercial).

The Basho engineers responsible for Yokozuna have created a resources page that includes recorded talks, Solr documentation links, and books on the topic.

Security
Basho designed Riak with critical data in mind. Whether it’s data that affects revenue, user experience, or even a patient’s health (as is the case with the NHS), Riak ensures that this critical data is always available. However, often this critical data is also sensitive data. Riak 2.0 adds security to this data through the ability to administer access rights and plug-in various secure authentication models commonly used today.

The initial RFC that describes the security effort, including related Pull Requests, is available at github.com/basho/riak/issues/355.

Simplified Configuration Management
At Basho, we pride ourselves on providing operationally friendly software that functions smoothly when dealing with the challenges of a distributed system. In the past, configuration of Riak occurred in two files: app.config and vm.args. Riak 2.0 changes how and where configuration information is stored. It no longer uses Erlang-specific syntax but, rather, provides a layout more suited for all operators and automated deployment tools. This layout is easy to parse and transparent for Riak administrators.

More information on the vision and specific implementation considerations are contained in the repository at github.com/basho/cuttlefish.

Bucket Types
In versions of Riak prior to 2.0, keys were made up of two parts: the bucket they belong to and a unique identifier within that bucket. Buckets act as a namespace and allow for similar keys to be grouped. In addition, they provide a means of configuring how previous versions of Riak treated that data.

In Riak 2.0, several new features (security and strong consistency in particular) need to interact with groups of buckets. To this end, Riak 2.0 includes the concept of a Bucket Type. In addition to allowing new features without special prefixes in Bucket names, Riak developers and operators are able to define a group of buckets that share the same properties and only store information about each Bucket Type, rather than individual buckets.

More information about Bucket Types can be found in the Github Issue at github.com/basho/riak/issues/362. This issue describes the planned functionality, discussions about implementation, and includes related pull requests.

Change in Defaults for Sibling Resolution
Riak has always supported both application-side and timestamp and vector clock-based Last Write Wins server-side resolution. Prior to Riak 2.0, vector clock-based Last Write Wins has been the default. Moving forward, new clusters will hand off siblings to applications by default. This is the safest way to work with Riak, but requires developers to be aware of sibling resolution.

In a blog series entitled, Understanding Riak’s Configurable Behaviours, Basho Evangelist John Daily discusses the configuration of Last Write Wins, and many other options, in great detail.

More Efficient Use of Physical Memory
Riak nodes are designed to manage the changing demands of a cluster as it experiences network, hardware, and other failures. To do this, Riak balances each node’s resources accordingly. Riak 2.0 has vastly improved LevelDB’s use of available physical memory (RAM) by allowing local databases to dynamically change their cache sizes as the cluster fluctuates under load.

In the past, it was necessary to specify RAM allocation for different LevelDB caches independently. This is no longer the case. In Riak 2.0, LevelDB databases that manage key/value or active anti-entropy data share a single pool of memory, and administrators are free to allocate as much of the available RAM to LevelDB as they feel is appropriate in their deployment. Detailed implementation documentation can be found in the basho/leveldb wiki.

Riak Ruby Vagrant Project
If you are interested in testing Riak 2.0, in a contained environment with the Riak Ruby Client, Basho engineer Bryce Kerley has put together the Riak-Ruby-Vagrant repository. In addition, this environment can be easily adapted to usage with other clients for testing the new features of Riak 2.0.

Basho

Building Retail and eCommerce Services with Riak

January 31, 2013

This is the second in a series of blog posts covering Riak for retail and eCommerce platforms. To learn more, join our “Retail on Riak” webcast on Friday, February 8th or download the “Riak for Retail” whitepaper.

In our last post, we looked at three Riak users in the eCommerce/retail space. In this post, we will look at some common use cases for Riak and how to start building them with Riak’s key/value model and querying features.

Use Cases

  • Shopping Carts: Riak’s focus on availability makes it attractive to retailers offering shopping carts and other “buy now” functionality. If the shopping cart is unavailable, loses product additions, or responds slowly to users, it has a direct impact on revenue and user trust.
  • Product Catalogs: Retailers need to store anywhere from thousands to tens of thousands of inventory items and associated information – such as photos, descriptions, prices, and category information. Riak’s flexible, fast storage makes it a good fit for this type of data.
  • User Information: As mobile, web, and multi-channel shopping become more social and personalized, retailers have to manage increasing amounts of user information. Riak scales efficiently to meet increased data and traffic needs and ensures user data is always available for online shopping experiences.
  • Session Data: Riak provides a highly reliable platform for session storage. User/session IDs are usually stored in cookies or otherwise known at lookup time, a good fit for Riak’s key/value mode.

Data Modeling

In Riak, objects are comprised of key/value pairs, which are stored in flat namespaces called “buckets.” Riak is content-type agnostic, and stores all objects on disk as binaries, giving retailers lots of flexibility to store anything they want. Here are some common approaches to modeling the data and services discussed above in Riak:

Querying

Riak provides several features for querying data:

Riak Search: Riak Search is a distributed, full-text search engine. It provides support for various MIME types & analyzers, and robust querying.
Possible Use Cases: Searching product information or product descriptions.

Secondary Indexing: Secondary Indexing (2i) gives developers the ability, at write time, to tag an object stored in Riak with one or more values, queryable by exact matches or ranges of an index.
Possible Use Cases: Tagging products with categories, special promotion identifiers, date ranges, price or other metadata.

MapReduce: Riak offers MapReduce for analytic and aggregation tasks with support for JavaScript and Erlang.
Possible Use Cases: Filtering product information by tag, counting items, and extracting links to related products.

Check out our docs for more information on building applications and services with Riak.

For more details and examples of common Riak use cases, register for our “Retail on Riak” webcast on February 8th or download the “Riak for Retail” whitepaper.

Basho

Slides – Searching and Accessing Data in Riak

September 24, 2012

Last week we aired a live broadcast from our sunny San Francisco offices on searching and accessing data in Riak. This is one of the top things we get questions about – especially from people moving from databases that use other data models or people building new applications on top of Riak.

Slides are below – unfortunately we had some audio quality issues with the video and need to rerecord it for upload. The webinar covers Riak’s full-text search, object tagging and Map Reduce capabilities including:

  • use cases and features
  • when NOT to use
  • query examples
  • user scenarios
  • high-level architecture and configuration
  • hybrid architectures

Check out slides below and make sure to read our docs on the subject.

SLIDES

In the talk, we mention a blog post from Riak user Clipboard on how they optimized Riak Search for their application, which is an awesome way to save and share online content. You can read that blog post here.

Protobuffs in Riak 1.2

July 18, 2012

You might remember that back in April, we sent around a survey to get input about what features developers use and want in Riak clients. All in all, we had about 87 developers respond to the survey.

One of the questions in that survey — and the one that was the most interesting to me — asked the respondent to rank some potential features for the roadmap. At the top of that list in the results was to support Secondary Index (2I) and Riak Search queries natively from the Protocol Buffers (Protobuffs) interface. You could already query them by sending a MapReduce request, but the additional step was confusing for some, and slow for others. I set out to make these features happen for the Riak 1.2 release.

Coupling challenges

Originally, the Protobuffs interface was created in an effort to satisfy a customer’s specific performance issue with HTTP, back around Riak version 0.10 or so. It seemed to work well for others, too, and so it got merged into the mainline. From that point until 1.0, not much was done with it. In Riak 1.0, it got a slew of new options — especially enhancments to Key-Value operations like get, put, and delete — that brought it closer to feature-parity with the HTTP interface.

Now, simply adding 2I queries to the existing system would have been straightforward, but search queries would not have been so. Why?

  • While the HTTP interface of Riak has always been built atop Webmachine, making it easy to
    add new resources as needed, the Protobuffs components were part of riak_kv. In fact, the Protobuffs interface was created while riak_search was still in its infancy, and when we had little idea what its interface would look like. Adding a coupling back the other direction (from riak_kv to riak_search) might just make the problem worse.
  • The riak-erlang-client was a dependency of riak_kv so that they could share the riakclient.proto file that contained all of the protocol message definitions. This made the Riak codebase potentially brittle to changes in the client library and made it necessary to copy the riakclient.proto file to our other clients that generate code from it.
  • We were using an antiquated version of the erlang_protobuffs library that we had forked and not kept up-to-date. The new maintainer had added features like extensions that we would like to use in the future. If I recall correctly, our version didn’t even properly support enumerations.

Refactoring

With those problems in mind and with the help of a few of my fellow Engineers, I set out to refactor the entire thing. Here’s what we came up with.

First, we separated the connection management from the message processing. This is a bit like how Webmachine works, where the accepting (mochiweb) and dispatching (webmachine) of an incoming HTTP message is separate from processing the message (your resource module and the decision graph). The result of our refactoring is the new riak_api OTP
application. It consists of a TCP listener, server processes that are spawned for each connection, and a registration utility for defining your own message handlers which are called “services”. Here’s how riak_kv registers its services:

erlang
riak_api_pb_service:register([{riak_kv_pb_object, 3, 6}, %% ClientID stuff
{riak_kv_pb_object, 9, 14}, %% Object requests
{riak_kv_pb_bucket, 15, 22}, %% Bucket requests
{riak_kv_pb_mapred, 23, 24}, %% MapReduce requests
{riak_kv_pb_index, 25, 26} %% Secondary index requests
])

Each service, represented as a module that implements the riak_api_pb_service behaviour, specifies a range of message codes it can handle. When an incoming message with a registered message code is received, it is dispatched to the corresponding service module, which can then do some processing and decide what messages to send back to the client.

Second, we separated the Protobuffs message definitions from the Erlang client library. We put the .proto file in a new library application called riak_pb, and actually split it out into several files, grouped by the component of the server they represent; this means there’s a riak.protoriak_kv.proto, and riak_search.proto. In addition to removing the coupling between the Erlang client and the server, we now have a project whose only responsibility is to describe the messages of the protocol. It’s like the equivalent of an RFC, but in code! In the near future we will have build targets in the project that let us generate Java or Python shims from the included messages and that we can distribute as standalone .jar and .egg files.

Third, we merged upstream changes from the new erlang_protobuffs maintainer and made some updates of our own. In addition to the features like extensions, the newer version has a more complete test suite. Our own updates fixed some bugs and edge cases in the library so that we could improve the overall experience for users. For example, when encountering an unknown message field, the TCP connection will no longer close because of a decoding error; instead, the unknown field will just be ignored.

New features

Whew, that was a lot of work just to get to good stuff! With the updated code structure and a plan with how to move forward, we added two new services, one in riak_kv (supporting native 2I) and one in riak_search (supporting native search-index queries), and four new messages to riak_pb to support those services. We decided not to expose the “add to index” or “delete from index” features in riak_search because we want to take it in a direction that focuses on indexing KV data rather maintaining a separate index-management interface. If you’re already using the “search KV hook” to index your data, you’ll be fine.

Client-side support for these new requests and responses has already landed in the Ruby client and will soon be landing in JavaErlang, and Python. You can track support for the new features on our updated Client Libraries wiki page.

Roadmap

Those two new client-facing features are great, but the survey showed us a lot more about what you want and need from Riak’s interfaces. For future releases we’ll be investigating how to improve Protobuff’s error messages and support for bucket properties, how to expose bulk or asynchronous operations, and much more.

Keep using Riak and sending us great feedback!

Sean