Tag Archives: Riak

Praekelt Foundation Deploys Riak to Improve the Health and Well-Being of People Living in Poverty Throughout Africa

September 17, 2013

The Praekelt Foundation is a non-profit that builds open source, scalable mobile technologies and solutions to improve the health and well-being of people living in poverty. Their Vumi solution was created as a response to the rapid spread of mobile phones across Africa. Vumi allows for large scale mobile messaging using SMS and USSD, so no internet connectivity is required. Vumi uses Riak as a super reliable backend to store all the messages that are being processed and all responses. This data is all archived to allow for further analysis to see trends of areas and which campaigns are the most successful.

The Vumi Network reaches hundreds of thousands of end users across many countries. It works with non-governmental organizations to set up campaigns and services for emerging markets. These include education (Wikipedia uses Vumi to allow end-users to search and retrieve information from Wikipedia over SMS/USSD), health (partnering with Johnson & Johnson, the MAMA campaign (Mobile Alliance for Maternal Action) allows pregnant women to receive health information over SMS based pregnancy stage and HIV diagnosis), peaceful messaging (Sisi Ni Amani uses Vumi to prevent election violence in Kenya through grassroots engagement and tracking of early conflict warning signs), as well as many other utilities.

They had been using Postgres for years but, when it came to storing messages, they knew Postgres was only an interim solution. Since they needed a non-relational system, they started evaluating the key players in the NoSQL space. With MongoDB, they found the durability defaults needed for a performance boost were not adequate for their zero downtime needs; CouchDB did not give them the performance they needed; and Cassandra was too operationally intensive for their small team and Riak offered better features. When they began testing Riak, Praekelt Foundation Chief Engineer, Simon de Haan, was able to get a three-node cluster up and running on his laptop in 20 minutes. This operational simplicity, the reliability of the system, the ability to seamlessly scale to entire populations, and the range of query options made Riak a clear choice to power Vumi.

“It blew my mind how easy it was to set up Riak and was a huge selling point for our small operations team,” said de Haan. “We also needed a reliable system with solid up-time guarantees. Riak has never gone down on us and continues to survive individual restarts. The whole thing just works.”

Since launching Riak two years ago, they are running five nodes and push 1,000 messages each second. All messages are stored as JSON in Riak, which makes it easy for them utilize Secondary Indexing and MapReduce when querying this data. With the introduction of pagination with Secondary Indexes and Eventually Consistent Counters in Riak 1.4, they have also been able to move a lot of data from Redis over to Riak to take advantage of these new features. Additionally, The Praekelt Foundation will expand their querying capabilities later this year when Riak Search gets a makeover in the Riak 2.0 release.

The Praekelt Foundation is currently evaluating Riak and Riak CS for some of their other technologies and Basho is proud to be a part of such a great cause. For more information on the Praekelt Foundation and Vumi, visit their site at www.praekeltfoundation.org/


Superfeedr Deploys Riak for “The Cave”

September 12, 2013

Superfeedr provides a real-time API to any application that wants to produce (publishers) or consume (subscribers) data feeds without wasting resources or maintaining an expensive and changing infrastructure. It fetches and parses RSS or Atom feeds on behalf of its users and new entries are then pushed to subscribing applications using a webhook mechanism (PubSubHubbub) or XMPP. The Google Reader replacement is an example of a popular API built by Superfeedr that has backed up much of Google Reader.

Riak is used by Superfeedr to store the content from all feeds so users can retrieve past content (including the Google Reader API replacement), even if the feeds themselves may not include these entries anymore. This Riak datastore is referred to as “the cave.”

When Superfeedr first built “the cave” datastore, they opted for a cluster of large Redis instances (five servers with 8GB of memory each) due to its inherent speed. However, they realized that a more durable system was needed and the need to manually shard feeds across the cluster made it difficult to scale beyond storing a couple entries per feed. The scaling problem turned into an even larger issue because the average size of a stored entry was 2KB. Now, they had nearly 1,000 items per feed and 50 million feeds, translating to over 93TB of data and quickly growing.

They chose to move “the cave” to Riak due to its focus on availability (as delivering stale data was more important than delivering no data) and ease-of-scale. According to Superfeedr Founder, Julien Genestoux, “Riak solves the scalability problem elegantly. Through consistent hashing, our data is automatically distributed across the cluster and we can easily add nodes as needed.” While Riak does have a lower read performance than Redis, this proved to not be a problem as they found it easy to put caches in front of Riak if they needed to serve content faster.

Though Superfeedr found it easy to set up their Riak cluster, the default behavior for handling conflicts had to be adjusted for their use case. By working with Basho and the Riak community, they were able to find the right settings and optimize their conflict resolution algorithm. For more information on Riak’s configurable behaviors, check out our four-part blog series.

Superfeedr went into production in two phases: they started storing production data in the beginning of 2013 and began serving that data about two months later. During this period, Superfeedr was able to design their cluster infrastructure and thoroughly performance test it with actual production data.

Two types of objects are stored in Superfeedr’s Riak datastore: feeds and entries. Feeds are stored as a collection of internal feed ids, which correspond to the entries and include some meta-information, such as the title. Entries correspond to feed entries and are indexed by feedID-entryID, allowing them to store multiple entries for each feed. This indexing scheme allows entries to be retrieved, even if they lose track of the feed element, through a MapReduce job.

At write time, Superfeedr writes both the feed element and the entry element. When they query for a feed, they issue a MapReduce job to read both the feed element and the desired number of entry items. They also use a pagination mechanism to limit the resources consumed for each request, with an arbitrary limit of 50 entries.

Today, Superfeedr has served over 23 billion entries, with nearly one million more being published every hour. Their six-node Riak cluster (built on 16GB Linode slices) has allowed them to horizontally scale their cluster as their content and user base grows. “Riak is the right tool for us due to its scalability and always on availability,” said Genestoux. “We have refined it to fit our needs and can rest-assured that no data will ever be lost in our Riak ‘cave.’”

If you’re looking for a Google Reader replacement or interested in learning more about Superfeedr, check out their site: superfeedr.com/. For other examples of Riak in production, visit: basho.com/riak-users/


Rovio Uses Riak to Power Angry Birds Toons and New Mobile Games

September 10, 2013

Rovio, the creator of Angry Birds, has announced that they use Riak to manage their exponential data growth from their cartoon series, Angry Birds Toons, and their new mobile video games.

In March of 2013, Rovio’s games had been downloaded 1.7 billion times, with hundreds of millions of active users. Their upcoming mobile game and cartoon series were expected to draw in an even larger audience. However, since popularity can be hard to predict, Rovio needed an infrastructure that could support viral growth if needed, without failing and causing downtime. Similarly, if demand was lower than anticipated, they also needed the flexibility to rein back the infrastructure to avoid unnecessary expenditure. Riak’s ease-of-scale was the perfect solution to support this uncertainty and now the Rovio IT team is able to scale from tens of Riak servers to hundreds based on customer demand.

Riak was also selected due to its robust, low-latency features, which have ensured that customers receive the service levels they have come to expect. Riak replicates data across multiple nodes within the database, meaning even if nodes fails, the system maintains its high performance profile and never loses critical user data. Finally, Riak supports multiple data formats, which means consistent services are guaranteed regardless of the type of device a gamer is using.

Since implementing with Riak, internal development has become much more streamlined due to Riak’s operational simplicity. The new in-house user-interface named “Bigbird Library” was built on top of Riak and provides Rovio’s developers with a consistent and simple interface. This means that less time is spent grappling with complex IT systems, and more time can instead be focused on improving existing services and developing new, engaging content.

For more details about why Rovio chose Riak and why distributed systems such as Riak are the right solution for gaming companies, check out Timo Herttua’s (Rovio Entertainment Product Manager) talk from the Game Developers Conference.

For more information on building gaming services with Riak, check out the Gaming Spotlight page and download the Gaming on Riak Whitepaper.


Rovio Avoids a Birdquake; Deploys Riak to Underpin Angry Birds Toons

Open source distributed database provides resilient infrastructure and eases growing pains

London – 3rd September 2013Basho Technologies, an expert in distributed systems and cloud storage software, has today announced that Angry Birds creator, Rovio, has implemented Riak. The scalable, open source NoSQL database has enabled Rovio to economically and effectively manage growing data volumes resulting from its growing number of data operations resulting from its new cartoon series, Angry Birds Toons, and new mobile video games.

During December 2012, Rovio had 263 million monthly active users and by March 2013, its games had been downloaded 1.7 billion times. With global demand surging and the launch of Angry Birds Toons and new mobile games set to put further strain on the infrastructure, Rovio needed to ensure that its high service levels could be maintained in a cost effective way. To add further complexity, data transactions across multiple platforms, including smartphones and tablets, meant that investment was needed to keep the user experience consistent.

“Rovio started as a gaming company in 2003, but the success of Angry Birds meant we were facing an exponential increase in the amount of data we were dealing with. The game is now the number one paid app of all time and this popularity prompted us to release Angry Birds Toons. We had to find a way of dealing with the spikes in demand caused by our new releases as well as supporting the continued growth of Angry Birds,” said Juhani Honkala, Vice President of Technology at Rovio. “Basho’s distributed data store, Riak, allows us to deliver availability to our customers whilst maintaining a standard of service that we can pride ourselves on. These standards are crucial for us as gaming users will simply not continue to play if the interface is latent or unreliable.”

Angry Birds Toons and new mobile games were expected to draw in a large audience but as with any game or entertainment medium, popularity is hard to predict. Rovio needed an infrastructure that could support viral growth if needed without failing and causing downtime. Similarly, if demand was lower than anticipated, flexibility was needed to ensure that infrastructure could be reined back, avoiding unnecessary expenditure. The Riak solution addressed these goals and Rovio’s IT team is now able to scale from tens of Riak servers to hundreds, based on customer demand.

As well as being agile, Rovio has benefitted from Riak’s robust, low-latency features which have served to ensure customers receive the service levels they expect. Riak replicates data across multiple nodes within the database, providing a high tolerance for hardware failure without losing critical user data. This means that even if one node fails, the system maintains its high performance profile. Multiple data formats are also supported meaning consistent services are guaranteed regardless of the type of device a gamer is using.

Maintaining impeccable customer service has been a key benefit of this project and internal development has also become more streamlined since implementing Riak. A new in-house user-interface named ‘Bigbird library’ has been created on top of Riak, providing Rovio’s developers with a consistent and simple interface. This means that less time is spent grappling with complex IT systems, and more time can instead be focused on improving existing services and developing new, engaging creative.

“Providing the infrastructure for hundreds of millions of users is no small feat. The world is becoming much more connected, and people are using more devices than ever before. Keeping track of those data types and scaling to meet demand cost-effectively can be a huge challenge,” continued Juhani Honkala, Vice President of Technology at Rovio. “With Riak, Basho has provided us with the fast, scalable and flexible foundation needed to address the challenges associated with cross-platform entertainment. This has been done while keeping operational costs affordable and while providing the best possible experience to our global fan base.”

“Gaming providers increasingly face new challenges. Providing services at a global scale, with zero downtime, and while handling data in a number of different formats can be hugely complex. Get it wrong and the entertainment medium – whether that is a game or otherwise – will not be successful,” said Bobby Patrick, chief marketing officer at Basho. “Rovio is one of the most recognised gaming franchises in the world and the steps it has taken to support exceptional customer service will continue to draw in fans and make its products stand out from the crowd.”

About Basho Technologies
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed database, Riak, and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by over 25 percent of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.

Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com. Basho is headquartered in Cambridge, Massachusetts and has offices in London, San Francisco, Tokyo and Washington DC.

About Rovio Entertainment
Rovio Entertainment Ltd is an industry-changing entertainment media company headquartered in Finland, and the creator of the globally successful Angry Birds™ characters franchise. Angry Birds, a casual puzzle game, became an international phenomenon within a few months of its release. Angry Birds has expanded rapidly into multifaceted entertainment, publishing, and licensing to become a beloved international brand. Rovio’s animated Angry Birds feature film is slated for July 1, 2016. www.rovio.com

Media Contacts
Jeremy Hill, Basho, Jeremy@basho.com
Adele Connell, AxiCom, Adele.Connell@axicom.com

The Evolution of Basho's Chef Cookbook

September 9, 2013

Chef is a configuration management system that is widely deployed by Operations teams around the world. Tools like Chef can bring sanity and uniformity when deploying a massive Riak cluster; however, as with any tool, it needs to be reliably tested, as any misconfiguration could bring down systems. Here is the story of Chef and Riak.


The first public Chef Cookbook was pushed to Github on July 18, 2011, back when Riak 0.14.2 was the latest and greatest. We started by making the basic updates for releases but, as both the Riak and Chef user base grew, so did the number of issues and pull requests. Even with some automation, testing was still time consuming, error prone, and problematic. Too much time was being spent catching bugs manually, a familiar story for anyone who has had to test any software.

Our initial reaction was to only keep what we knew users (primarily customers) were using. As testing the build from source was so time consuming, it was removed until we could later ensure that it be properly tested. We knew that we had to start automating this testing pipeline to not only maintain quality, but sanity. Fortunately, Chef’s testing frameworks were beginning to come into their own and a free continuous integration service for GitHub repositories called TravisCI was starting to take off. However, before talking about the testing frameworks, we need to cover two tools that help make this robust testing possible.


Vagrant is a tool that leverages virtualization and cloud providers, so users don’t have to maintain custom static virtual machines. Vagrant was on the rise when we started the cookbooks and was indispensable for early testing. While it didn’t offer us a completely automated solution, it was far ahead of anything else at the time and serves as a great building block for our testing today.

There are also a variety of useful plugins that we use in conjunction with it, including vagrant-berkshelf and vagrant-omnibus. The former integrates Vagrant and Berkshelf so each Vagrant “box” has its own self-contained set of cookbooks that it uses and the latter allows for easy testing of any version of Chef.


Berkshelf manages dependencies for Chef cookbooks – like a Bundler Gemfile for cookbooks. It allows users to identify and pin a known good version, bypassing the potential headaches of trying to keep multiple cookbooks in sync.

Now, back to testing frameworks.

Enter Foodcritic

Foodcritic is a Ruby gem used to lint cookbooks. This not only checks for basic syntax errors that would prevent a recipe from converging, but also for style inconsistencies and best practices. Foodcritic has a set of rules that it checks cookbooks against. While most of them are highly recommended, there may be a few that don’t apply to all cookbooks and can be ignored on execution. Combine this with TravisCI, and each commit or pull request to GitHub is automatically tested.

While this is helpful, it still didn’t actually help us test that the cookbooks worked. Luckily, we weren’t the only ones with this issue, which is why Fletcher Nichol wrote test-kitchen.


Test-kitchen is another Ruby gem that helps to integrate test cookbooks using a variety of drivers (we use the Vagrant driver). For products like Riak and Riak CS, there are a number of supported platforms that we need to run the cookbook against, and that’s exactly what this tool accomplishes.

In the configuration file for test-kitchen, we define the permutation of Vagrant box, run list, and attributes for testing as many cases for the cookbook as needed. With this, we are able to execute simple Minitests against multiple platforms and we can also test both our enterprise and open source builds at any version by configuring attributes appropriately.

Granted, if you need to spin up a ton of virtual machines in parallel, you’ll want a beefy machine, but the upside is that you’ll have a nice status report to know which permutation of platform/build/version failed.

Why is This Important?

With these tools, we are able to make sure our builds pass all tests across platforms. Since we have many customers deploying the latest Riak and Riak CS version with Chef, we need to ensure that everything works as expected. These tools allowed us to move from testing every cookbook change manually to automatically testing the permutations of operating system, Chef, and Riak versions.

Now everyone gets a higher quality cookbook and there are fewer surprises for those maintaining it. Testing has shifted from a chore to a breeze. This benefits not only our users, but ourselves included as these cookbooks are used to maintain our Riak CS demo service.

Check out our Docs Site for more information about installing Riak with Chef.

Special thanks to Joshua Timberman (advice), Fletcher Nichol (test-kitchen), Hector Castro (reviews and PRs), Mitchell Hashimoto (Vagrant), and Jamie Winsor (Berkshelf).

Seth Thomas

Mobile Taxi Platform Flywheel Uses Riak to Power Their Application

September 5, 2013

Flywheel is a mobile taxi hailing platform that is in more than half of the cabs in San Francisco and has recently expanded to LA. Through the mobile application, users can request cabs, view their location, and pay via their smartphone. Flywheel is currently using Riak for their underlying passenger and driver engine. This engine stores information such as user accounts, passenger information, taxi information, and ride details. Riak also stores real-time production data, such as passenger ride requests and ride accepts from drivers.

Part of the future growth strategy for Flywheel was shifting to a purely Linux and open source based infrastructure. This meant moving away from a more traditional closed source relational database system. They needed something that was easy to get up and running and that didn’t require a lot of developer resources to manage. Flywheel evaluated a number of different open source choices, including Redis, MongoDB, Cassandra, and CouchDB. Ultimately, they decided to move to Riak and supplement it with Postgres, as Riak offered the most operational simplicity.

Flywheel went into production with Riak in September of 2012. They are currently running eight nodes in their cluster and handle 25,000-30,000 writes and 50,000-60,000 reads each day. Riak’s key/value data model has been a natural fit for the application’s “events” that happen each time a taxi ride is processed. These events include taxi hails, driver response, taxi rides, ride payments, etc. and buckets are used to group them together within Riak.

For more information about Flywheel, check out their site or download their app. To learn more about Riak, visit basho.com/riak/.


Mobile Taxi Hailing Application, Flywheel, Uses Basho’s Riak to Ensure App Availability During Peak Events

SAN FRANCISCO – SEPTEMBER 5, 2013Flywheel, a mobile taxi hailing platform, works with Basho’s distributed NoSQL database, Riak, to power their passenger and driver engine. During Bay to Breakers 2013, San Francisco’s popular marathon, Flywheel saw their highest traffic spikes due to the hundreds of thousands of people trying to get around the city. While their application was experiencing peak loads, Riak ensured that this concurrent usage had no impact on the end-user experience and seamlessly handled this traffic at low-latency with zero downtime.

Flywheel is in more than half of the cabs in San Francisco and has recently expanded to the LA area. Due to their quick growth and need for operational simplicity, Flywheel decided to move their platform to Riak in September of 2012. Riak provided the scale and ease-of use necessary for Flywheel’s small team and beat out many alternative NoSQL databases.

“Bay to Breakers was an important time for us to solidify our place in the mobile taxi market,” said Cuyler Jones, Chief Architect at Flywheel. “Part of why we moved to Riak was to leverage its high availability and scalability, which it achieved perfectly. It was great to have one less thing to worry about during this key event.”

Flywheel’s passenger and driver engine stores information such as user accounts, passenger information, taxi information, and ride details. In addition, Riak is also used to store real-time production data, such as passenger ride requests and ride accepts from drivers.

Basho’s Riak is an open source distributed database is designed for always-on availability, fault-tolerance, scalability, and ease-of-use. It is used by companies worldwide that need to always store and access critical data. Mobile is one of the most common use cases for Riak, due to the high availability and low-latency requirements, as well as the need to scale quickly to meet peak loads. For a look at how others use Riak to solve the challenges of mobile applications and services, visit the Mobile Spotlight page.

Flywheel is currently running eight nodes in their Riak cluster and handle 25,000-30,000 writes and 50,000-60,000 reads each day. For more information about Flywheel, check out their site and download their app. To learn more about Riak, visit basho.com/riak/.

About Basho
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed database, Riak and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by over 25 percent of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.

Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com. Basho is headquartered in Cambridge, Massachusetts and has offices in London, San Francisco, Tokyo and Washington DC.

About Flywheel
With offices in San Francisco and Redwood City, Flywheel Software, Inc was founded in 2010 to provide an all-new experience to both passengers and drivers of for-hire vehicles. The Flywheel service includes a mobile app whereby its customers order taxi rides in real-time, track arrival via GPS and automatically pay their fare–all from their smart phone device.

Flywheel’s Investors include Craton Equity Partners of Los Angeles, Shasta Ventures, RockPort Capital, Sand Hill Angels and members of the Band of Angels. Flywheel can be found at www.flywheel.com

Indexing the Zombie Apocalypse With Riak

September 4, 2013

For more background on the indexing techniques described, check out our blog “Index for Fun and for Profit

Check out the Term-Based Inverted Zombie Index App here: zombies.samples.basho.com

The War Against Zombies is Still Raging!

In the United States, the CDC has recovered 1 million Acute Zombilepsy victims and has asked for our help loading the data into a Riak cluster for analysis and ground team support.

Know the Zombies, Know Thyself

The future of the world rests in a CSV file with the following fields:

  1. DNA
  2. Gender
  3. Full Name
  4. StreetAddress
  5. City
  6. State
  7. Zip Code
  8. TelephoneNumber
  9. Birthday
  10. National ID
  11. Occupation
  12. BloodType
  13. Pounds
  14. Feet Inches
  15. Latitude
  16. Longitude

For each record, we’ll serialize this CSV document into JSON and use the National ID as the Key. Our ground teams need the ability to find concentrations of recovered zombie victims using a map so we’ll be using the Zip Code as an index value for quick lookup. Additionally, we want to enable a geospatial lookup for zombies so we’ll also GeoHash the latitude and longitude, truncate the hash to four characters for approximate area lookup, and use that as an index term. We’ll use the G-Set Term-Based Inverted Indexes that we created since the dataset will be exclusively for read operations once the dataset has been loaded. We’ve hosted this project at Github so that, in the event we’re over taken by zombies, our work can continue.

In our load script, we read the text file and create new zombies, add Indexes, then store the record:

load_data.rb script

Our Zombie model contains the code for serialization and adding the indexes to the object:

zombie.rb add index

Let’s run some quick tests against the Riak HTTP interface to verify that zombie data exists.

First let’s query for a known zombilepsy victim:

curl -v

Next, let’s query the inverted index that we created. If the index has not been merged, then a list of siblings will be displayed:

Zip Code for Jackson, MS:
curl -v -H "Accept: multipart/mixed"

GeoHash for Washington DC:
curl -v -H "Accept: multipart/mixed"

Excellent. Now we just have to get this information in the hands of our field team. We’ve created a basic application which will allow our user to search by Zip Code or by clicking on the map. When the user clicks on the map, the server converts the latitude/longitude pair into a GeoHash and uses that to query the inverted index.

Colocation and Riak MDC will Zombie-Proof your application

First we’ll create small Sinatra application with the two endpoints required to search for zip code and latitude/longitude:

server.rb endpoints

Our zombie model does the work to retrieve the indexes and build the result set:

zombie.rb search index

Saving the world, one UI at a time

Everything is wired up with a basic HTML and JavaScript application:


Searching for zombies in the Zip Code 39201 yields the following:


Clicking on Downtown New York confirms your fears and suspicions:


The geographic bounding inherent to GeoHashes is obvious in a point-dense area so, in this case, it would be best to query the adjacent GeoHashes.

Keep Fighting the Good Fight!

There is plenty left to do in our battle against zombies! Check out the complete app here: zombies.samples.basho.com

  • Create a Zombie Sighting Report System so the concentration of live zombies in an area can quickly be determined based on the count and last report date.
  • Add a crowdsourced Inanimate Zombie Reporting System so that members of the non-zombie population can report inanimate zombies. Incorporate Baysian filtering to prevent false reporting by zombies. They kind of just mash on the keyboard so this shouldn’t be too difficult.
  • Add a correlation feature, utilizing Graph CRDTs, so we can find our way back to Patient Zero.

Dan Kerrigan & Drew Kerrigan

Customer.io Gains 6x Speed Improvement by Moving from MongoDB to Riak

August 28, 2013

Customer.io is passionate about helping their customers grow happy customers. Their focus is on creating genuine, relevant interactions for their customers. Of course, happy customers expect great performance. As Customer.io continues to rapidly grow, they are putting in place the foundation to deliver on those commitments.

Yesterday, Customer.io announced that they upgraded their architecture – moving from MongoDB to Riak. As described in their recent blog post, the move to Riak has provided an immediate and dramatic performance boost. Some performance highlights include:

  • User segmentation can run anywhere from 6x faster (raw performance) to 100x faster, taking into account that customer requests are now parallelizable. (To send more relevant, timely emails, Customer.io enables subsets of people to be grouped around similar characteristics.)
  • Processing time was reduced from 3 hrs to 30 minutes on a large segment.
  • Customer.io launched a new feature that shows percentage complete.

In addition from gaining the inherent benefits from Riak as a scalable, distributed system, Customer.io also implemented Go, an increasingly popular programming language. Go adds powerful message queuing, systems programming, and exceptional concurrency.

You can view the entire blog post from Customer.io here: customer.io/blog/Segment-customer-data-faster.html


Index for Fun and for Profit

August 28, 2013

What is an Index?

In Riak, the fastest way to access your data is by its key.

However, it’s often useful to be able to locate objects by some other value, such as a named collection of users. Let’s say that we have a user object stored under its username as the key (e.g., thevegan3000) and that this particular user is in the Administrators group. If you wanted to be able to find all users, such as thevegan3000 who are in the Administrators group, then you would add an index (let’s say, user_group) and set it to administrator for those users. Riak has a super-easy-to-use option called Secondary Indexes that allows you to do exactly this and it’s available when you use either the LevelDB or Memory backends.

Using Secondary Indexes

Secondary Indexes are available in the Riak APIs and all of the official Riak clients. Note that user_group becomes user_group_bin when accessing the API because we’re storing a binary value (in most cases, a string).

Add and retrieve an index in the Ruby Client:

In the Python Client:

In the Java Client:

More Example Use Cases

Not only are indexes easy to use, they’re extremely useful:

  • Reference all orders belonging to a customer
  • Save the users who liked something or the things that a user liked
  • Tag content in a Content Management System (CMS)
  • Store a GeoHash of a specific length for fast geographic lookup/filtering without expensive Geospatial operations
  • Time-series data where all observations collected within a time-frame are referenced in a particular index

What If I Can’t Use Secondary Indexes?

Indexing is great, but if you want to use the Bitcask backend or if Secondary Indexes aren’t performant enough, there are alternatives.

A G-Set Term-Based Inverted Index has the following benefits over a Secondary Index:

  • Better read performance at the sacrifice of some write performance
  • Less resource intensive for the Riak cluster
  • Excellent resistance to cluster partition since CRDTs have defined sibling merge behavior
  • Can be implemented on any Riak backend including Bitcask, Memory, and of course LevelDB
  • Tunable via read and write parameters to improve performance
  • Ideal when the exact index term is known

Implementation of a G-Set Term-Based Inverted Index

A G-Set CRDT (Grow Only Set Convergent/Commutative Replicated Data Type) is a thin abstraction on the Set data type (available in most language standard libraries). It has a defined method for merging conflicting values (i.e. Riak siblings), namely a union of the two underlying Sets. In Riak, the G-Set becomes the value that we store in our Riak cluster in a bucket, and it holds a collection of keys to the objects we’re indexing (such as thevegan3000). The key that references this G-Set is the term that we’re indexing, administrator. The bucket containing the serialized G-Sets accepts Riak siblings (potentially conflicting values) which are resolved when the index is read. Resolving the indexes involves merging the sibling G-Sets which means that keys cannot be removed from this index, hence the name: “Grow Only”.

administrator G-Set Values prior to merging, represented by sibling values in Riak


administrator G-Set Value post merge, represented by a resolved value in Riak


Great! Show me the code!

As a demonstration, we integrated this logic into a branch of the Riak Ruby Client. As mentioned before, since a G-Set is actually a very simple construct and Riak siblings are perfect to support the convergent properties of CRDTs, the implementation of a G-Set Term-Based Inverted Index is nearly trivial.

There’s a basic interface that belongs to a Grow Only Set in addition to some basic JSON serialization facilities (not shown):

gset.rb interface

Next there’s the actual implementation of the Inverted Index. The index put operation simply creates a serialized G-Set with the single index value into Riak, likely creating a sibling in the process.

inverted_index.rb put index term

The index get operation retrieves the index value. If there are siblings, it resolves them by merging the underlying G-Sets, as described above, and writes the resolved record back into Riak.

inverted_index.rb get index term

With the modified Ruby client, adding a Term-Based Inverted Index is just as easy as a Secondary Index. Instead of using _bin to indicate a string index and we’ll use _inv for our Term-Based Inverted Index.

Binary Secondary Index: zombie.indexes['zip_bin'] << data['ZipCode']

Term-Based Inverted Index: zombie.indexes['zip_inv'] << data['ZipCode']

The downsides of G-Set Term-Based Inverted Indexes versus Secondary Indexes

  • There is no way to remove keys from an index
  • Storing a key/value pair with a Riak Secondary index takes about half the time as putting an object with a G-Set Term-Based Inverted Index because the G-Set index involves an additional Riak put operation for each index being added
  • The Riak object which the index refers to has no knowledge of which indexes have been applied to it
    • It is possible; however, to update the metadata for the Riak object when adding its key to the G-Set
  • There is no option for searching on a range of values (e.g., all user_group values from administrators to managers)

See the Secondary Index documentation for more details.

The downsides of G-Set Term-Based Inverted Indexes versus Riak Search:

Riak Search is an alternative mechanism for searching for content when you don’t know which keys you want.

  • No advanced searching: wildcards, boolean queries, range queries, grouping, etc

See the Riak Search documentation for more details.

Let’s see some graphs.

The graph below shows the average time to put an object with a single index and to retrieve a random index from the body of indexes that have already been written. The times include the client-side merging of index object siblings. It’s clear that although the put times for an object + G-Set Term-Based Inverted Index are roughly double than that of an object with a Secondary Index, the index retrieval times are less than half. This suggests that secondary indexes would be better for write-heavy loads but the G-Set Term-Based Inverted Indexes are much better where the ratio of reads is greater than the number of writes.


Over the length of the test, it is even clearer that G-Set Term-Based Inverted Indexes offer higher performance than Secondary Indexes when the workload of Riak skews toward reads. The use of G-Set Term-Based Inverted Indexes is very compelling even when you consider that the index merging is happening on the client-side and could be moved to the server for greater performance.


Next Steps

  • Implement other CRDT Sets that support deletion
  • Implement G-Set Term-Based Indexes as a Riak Core application so merges can run alongside the Riak cluster
  • Implement strategies for handling large indexes such as term partitioning

Dan Kerrigan