Tag Archives: open source

Basho Announces Open Source Riak CS and General Availability of Riak CS Enterprise v1.3

New York City, NY. – March 20, 2013 – Today at GigaOM Structure Data 2013 in New York City, Basho, the worldwide leader in distributed database and cloud storage software, announced that Riak CS (Cloud Storage) is now available open source, significantly expanding the ease-of-access to Basho’s software for developers, enterprise architects, and IT operations professionals seeking to build public or private storage clouds. Also today, Basho announced the general availability of Riak CS v1.3, the third release of Basho’s simple, available cloud storage software.

Riak CS is a multi-tenant, distributed, S3-compatible cloud storage platform that enables enterprises and service providers to launch public or private cloud services. Built on top of Riak, the world’s most advanced, open source, distributed database, Riak CS provides horizontal scale, extreme durability and low operational overhead in a distributed object storage system. Riak CS Enterprise adds Basho’s multi-datacenter replication technology and is backed by Basho’s 24×7 support and enterprise-class service-level commitments.

Riak CS Enterprise is used by great organizations worldwide including Datapipe, Deutsche Vermögensberatung (DVAG), IDC Frontier, Rovio, and Yahoo! JAPAN.

New Features in 1.3

  • Multipart Upload. Riak CS v1.3 includes a new multipart upload capability that lets users store very large files by uploading parts in parallel.
  • Enhanced Control for Multi-Tenant Environments. Riak CS v1.3 introduces object access control by source IP enabling operating to restrict access to Riak buckets by IP address.
  • Support for GET Range Requests. Riak CS users can now retrieve a range of bytes from a single object. This functionality is implemented in the “Range” request header of GET operations.
  • Graphical Tool for Riak CS. Riak CS Control is a standalone web administration tool for user management.

Product Availability

Basho Riak CS is now available for download. For more information, learn more on the Basho blog or visit our downloads page: Riak CS Download Site

Basho offers a hosted “sandbox” to test interfacing with a live implementation of Riak CS. The “sandbox” is available at https://www.riakcs.net/users/sign_in.

For more information on Riak CS Enterprise, and to request a Developers Trial License, click https://basho.com/riak-cloud-storage/.

Upcoming Riak CS Webinar and RICON EAST

Basho will host an “Introduction to Riak CS Webinar” on Tuesday, April 2. To participate in the webinar, sign up here.

Basho is hosting RICON EAST on May 13 – 14, 2013 in New York City, NY. RICON is Basho’s distributed systems conference by and for engineers, developers, scientists and architects. For ticket information on RICON East, visit http://ricon.io/east.html.


Greg Collins, president and CEO, Basho
“It has been almost one year since we first released Riak CS. In just 12 months, we have seen rapid adoption by global cloud operators, telecommunication providers and large enterprises. Over the past year, Riak CS has gained new advanced capabilities and has been battle-tested in many of our customers’ and partners’ labs. Our customers have deployed Riak CS as the object storage engine inside popular cloud computing platforms, including Apache CloudStack and OpenStack. Today, by open sourcing Riak CS, we are making it easier for users to experiment with and test Riak CS, to provide rapid product feedback, and to contribute to its future capabilities.”

Ash Yamanaka, general manager, IDC Frontier and
Shingo Saito, cloud product manager, Yahoo! JAPAN

“Basho, Yahoo! JAPAN and IDC Frontier, a member of Yahoo! JAPAN group, have a very strong and growing partnership. Today, Yahoo! JAPAN and IDC Frontier leverage Riak CS Enterprise to offer an S3-compatible public cloud storage service, as well as dedicated hosting options for our customers various applications. Yahoo! JAPAN and IDC Frontier are highly supportive of open source software and we view Basho’s announcement today as a positive move that will work to accelerate its ability to innovate and ultimately strengthen our cloud platform.”

Sameer Dholakia, group vice president and GM, Citrix Platforms Group, Citrix
“Basho clearly understands the market power of open source. Since Citrix and Basho started collaborating last year, we have seen strong enthusiasm among Citrix CloudPlatform users for Basho’s cloud object storage solution. Now, CloudStack users have easy access to Riak CS and can quickly deploy an object storage solution that features multi-tenancy and S3 compatibility. We believe that many Citrix CloudPlatform customers will also seek Riak CS Enterprise for its distributed data capabilities across multiple data centers.”

Ed Laczynski, vice president, Cloud Strategy and Architecture, Datapipe
“Datapipe is very supportive of Basho’s decision to open source portions of Riak CS. During the last six months, we have deployed Riak CS Enterprise in Datapipe’s 10gig Stratosphere cloud computing platform. Riak CS provides Datapipe and its customers with highly available, low-latency and S3-compatible storage. Datapipe’s customers will benefit as Basho’s community increasingly experiments, tests and contributes to Riak CS, ultimately speeding our access to more capabilities and higher performance.”

Simon Robinson, vice president of storage research, 451 Research
“The cloud storage market continues to accelerate as companies seek to build public and private storage clouds that mirror Amazon Web Services’ capabilities and economics. Basho, with Riak CS, already has a proven track record of successful customer public and private cloud deployments. Now, Basho is demonstrating it has confidence that the technical and business benefits of Risk CS can be accelerated even faster via the open source model.”

About Basho Technologies

Basho is a distributed systems company dedicated to making software that is always available, fault-tolerant and easy-to-operate at scale. Basho’s distributed NoSQL database, Riak, and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by over 25% of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.

Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com.

Basho is headquartered in Cambridge, Massachusetts and has offices in London, San Francisco, Tokyo and Washington DC.

Media Contact
Alex Gutow
Basho Marketing Manager

Riak CS is Now Open Source

March 20, 2013

Riak CS (Cloud Storage) is simple, available cloud storage software built on Riak. Basho announced today that Riak CS is now open source under the Apache 2 license. Organizations and users can now access the source code on Github and download the latest packages from the downloads page. Also, today, we announced that Riak CS Enterprise is now available as commercial licensed software, featuring multi-datacenter replication technology and 24×7 Basho customer support.

We will be hosting an introductory webcast to Riak CS on Tuesday, April 2. Sign up here.

Riak CS can be used to build private or public clouds or as reliable, available storage behind applications and platforms. Riak CS Enterprise is currently used by large corporations including Datapipe, Deutsche Vermögensberatung (DVAG), IDC Frontier, Rovio, and Yahoo! JAPAN.

Basho is a distributed systems company dedicated to making software that is available, fault-tolerant, and easy to operate at scale. Twenty-five percent of the Fortune 50 and thousands of open source users large and small run our flagship open source database, Riak. Riak CS takes distributed systems principles derived from production Riak users and applies it to the problem of large scale storage. We are excited to share this code with the world.

Riak CS features:

  • Highly available, fault-tolerant storage
  • Large object support
  • S3-compatible API and authentication
  • Multi-tenancy and per-user reporting
  • Simple operational model for adding capacity
  • Robust stats for monitoring and metrics

For users requiring multi-datacenter replication and enterprise-level support, Riak CS Enterprise (a commercial extension of Riak CS) is available.

New Features

Today we are also announcing several new features, available now as part of the open source edition.

  • Multipart upload. Upload very large files to Riak CS as a series of parts. Parts can be between 5MB and 5GB.
  • Support for GET range queries. Retrieve a range of bytes from a single object. This functionality is implemented in the “Range” request header of GET operations.
  • Per-bucket policies to restrict access to buckets based on source IP.
  • Riak CS Control. Riak CS Control is a standalone web administration tool for user management available on Github.

Supporting Quotes

“Basho, Yahoo! JAPAN, and IDC Frontier a member of Yahoo! JAPAN group have a very strong and growing partnership. Today, Yahoo! JAPAN and IDC Frontier leverage Riak CS Enterprise to offer an S3-compatible public cloud storage service, as well as dedicated hosting options for our customers various applications. Yahoo! JAPAN and IDC Frontier are highly supportive of open source software and we view Basho’s announcement today as a positive move that will work to accelerate its ability to innovate and ultimately strengthen our cloud platform.”
- Ash Yamanaka, general manager, IDC Frontier and
– Shingo Saito, cloud product manager, Yahoo! JAPAN

“Basho clearly understands the market power of open source. Since Citrix and Basho started collaborating last year, we have seen strong enthusiasm among Citrix CloudPlatform users for Basho’s cloud object storage solution. It has also provided the Apache CloudStack community with easy access to Riak CS for multi-tenancy and S3 compatibility. With today’s announcement, Citrix CloudPlatform customers will continue to benefit from Riak CS Enterprise for its distributed data capabilities across multiple data centers.”
- Sameer Dholakia, group vice president and GM, Citrix Platforms Group, Citrix

“Over the last six months, we have deployed Riak CS Enterprise within Datapipe’s 10gig Stratosphere cloud computing platform. Riak CS provides our customers with highly available, low-latency, S3-compatible cloud object storage. Datapipe is very supportive of Basho’s decision to open source portions of Riak CS. As Basho’s open source community grows, experiments, tests and contributes to Riak CS, Datapipe clients will benefit from access to additional capabilities and higher performance.”
- Ed Laczynski, vice president, Cloud Strategy and Architecture, Datapipe


Please join us for an introductory technical webcast to Riak CS on April 2. You can also read a technical overview on our website and find full documentation here.

In the coming weeks and months, we look forward to helping new users get started with Riak CS and be successful running it in production. We’ll be expanding integration and partnerships with open source cloud computing platforms in order to provide integrated storage and compute to the marketplace. As always, we’ll be listening to feedback, engaging with the community, and accepting pull requests.


Where To Start With Riak Core

April 12, 2011

There has been a lot of buzz as of late around “riak_core” in various venues, so much so that we are having trouble producing enough resources and content to keep the community at bay (though we most-certainly have plans to). While we hustle to catch up, here is the rundown on what is currently available for those of you who want to learn about, look at, and play with riak_core.

(TL;DR – riak_core is the distributed systems framework that underpins Riak and is, in our opinion, what makes Riak the best and most-robust distributed datastore available today. If you want so see it in action, go download Riak and put it through its paces.)


If you know nothing about riak_core (or are in the mood for a refresher), start with the Introducing Riak Core blog post that appeared on the Basho Blog a while back. This will introduce you, at a very high-level, to what riak_core is and how it works.

Slides and Videos

There are varying degrees of overlap in each of these slides and videos, but they all address riak_core primarily.


  • riak_core repo on GitHub
  • Basho Banjo – Sample application that uses Riak Core to play distributed music
  • Try Try Try – Ryan Zezeski’s working blog that is taking an in depth look at various aspects of riak_core
  • rebar_riak_core – Rebar templates for riak_core apps from the awesome team at Webster/Clay

Getting Involved With Riak and Riak Core

We are very much as the beginning of what Riak Core can be as a stand alone platform for distributed applications, so if you want to get in at the ground floor of something that we feel is truly innovative and unparalleled, now is the time. The best way to join the conversation and to help with the development of Riak Core is to join the Riak Mailing list where you can start asking questions and sharing code.

If you want to see riak_core in action, look no further than Riak, Riak Search, and Luwak. The distribution and scaling components for all of these projects if handled by riak_core.

Also, make sure to follow the Basho Team on Twitter as we spend way too much time talking about this stuff.


Free Software Can Not Be Taken Away

November 15, 2010

Oracle didn’t (and can’t) take away your open source software.

A few weeks ago Oracle caused a lot of confusion when they changed the makeup of the MySQL product line, including a “MySQL Classic Edition” version that does not cost money and does not include InnoDB. That combination in the product chart made many people wonder if InnoDB itself had ceased to be free in either the “free beer” or “free speech” sense. The people wondering and worrying included a few users of Innostore, the InnoDB-based storage engine that can be used with Riak.

Luckily, open source software doesn’t work that way.

Oracle didn’t really even try to do what some people thought; they just released a confusing product graph which they have since updated. The MySQL that most people think of first is MySQL Community Edition and it was not one of the editions mentioned in the chart that confused people. That version of MySQL, as well as all of the GPL components included in it such as InnoDB, remain free of cost and also available under the GPL.

This confusion eventually led to a public response from Oracle, so you can read it authoritatively if you like.

Even if someone wanted to, they couldn’t “take it back” in the way that some people feared. Existing software that has been legitimately available under an open source license such as GPL or Apache cannot retroactively be made unfree. The copyright owner might choose to not license future improvements as open source, but that which is already released in such a way cannot be undone. Oracle and Innobase aren’t currently putting new effort into Embedded InnoDB, but a new project has spun up to move it forward. If the HailDB project produces improvements of value, then future versions of Innostore may switch to using that engine instead of using the original Embedded InnoDB release.

InnoDB is available under the GPL. Innostore, as a derivative work of Embedded InnoDB, is also available under the GPL. Neither Oracle nor Basho can take that away from you.


A Few More Details On Why We Switched To GitHub

November 11, 2010

We announced recently on the Riak Mailing List that Basho was switching to git and GitHub for development of Riak and all other Basho software. As stated in that linked email, we did this primarily for reasons pertaining to community involvement in the development of Riak. The explanation on the Mailing List was a bit terse, so we wanted to share some more details to ensure we answered all the questions related to the switch.

Some History

Riak was initially used as the underlying data store for an application Basho was selling several years ago and, at that time, its development was exclusively internal. The team used Mercurial for internal projects, so that was the de-facto DVCS choice for the source.

When we open-sourced Riak in August 2009, being Mercurial users, we chose to use BitBucket as our canonical repository. At the time we open-sourced it, we were less concerned with community involvement in the development process than we are now. Our primary reason for open-sourcing Riak was to get it into the hands of more developers faster.

Not long after this happened, the questions about why we weren’t on GitHub started to roll in. Our response was that we were a Mercurial shop and BitBucket was a natural extension of that. Sometime towards the beginning of May we started maintaining an official mirror of our code on GitHub. This mirror was our way of acknowledging that there is more than one way to develop software collaboratively and that we weren’t ignoring the heaps of developers who were dedicated GitHub users and preferred to look at and work with code on this platform.

Some Stats

GitHub has the concept of “Watchers” (analogous to “Followers” on BitBucket). We started accumulating Watchers once this GitHub mirror was in place. “Watchers” is a useful, but not absolute, metric for measuring interest and activity in a project. They bring a tremendous amount of attention to any given project through their use of the code and their promotion of it. They also, in the best case scenario, will enhance the code in a meaningful way by finding bugs and contributing patches.

This table shows the week on week of growth of BitBucket Followers vs. GitHub Watchers since we put the official mirror in place:

BitBucket GitHub
Number of Followers/Watchers at Time of Switch 97 145
Avg. Week on Week Growth (%) 0.74 7.2


Since putting the official mirror in place, the number of Watchers on the GitHub repo for Riak has grown at steady ready, averaging just over 7% week on week. This far outpaced the less than 1% growth in Followers on the canonical Bitbucket repository for Riak.

With this information it was clear that Riak on GitHub as a mirror was bringing us more attention and driving more community growth than was our canonical repo on BitBucket. So, in the interest of community development, we decided that Riak needed to live on GitHub. What they have built is truly the most collaborative and simple-to-use development platform there is (at least one well-respected software analyst has even called it “the future of open source”). Though Mercurial was deeply ingrained in our development process, we were willing to tolerate the workflow hiccups that arose during the week or so it took to get used to git in exchange for the resulting increase in attention and community contributions.

The switch is already proving fruitful. In addition to the sharp influx in Watchers for Riak, we’ve already taken some excellent code contributions via GitHub. That said, there is much left to be written. And we would love for you to join us in building something legendary in Riak, whatever your distributed version control system and platform preference may be.

So when you get a moment, go check out Riak on Github, or, if you prefer, Riak on BitBucket. And if you have any more questions, feel free to email: mark@basho.com.


Basho West and the Riak One Year Anniversary

July 19, 2010

Basho is growing. Fast. We are adding customers and users at a frenetic pace, and with this growth comes expansion in both team and locations. As some of you may have noticed, the Basho Team is not only becoming larger but more distributed. We now have people in six states scattered across four time zones pushing code and interacting with clients everyday.

First Order of Business

To bolster this growth and expansion, we did what any self-respecting tech startup would do: we opened an office in San Francisco. Several members of the Basho Team recently moved into a space at 795 Folsom, a cozy little spot a mere five floors below Twitter. (Proximity to the Nest was a requirement when evaluating office space.) We are calling it “Basho West.” There are four of us here, and we are settling in quite nicely.

If you are in the area and want to talk Riak, Basho, open source, coffee, etc., stop in and pay us a visit any time. Seriously. If you walk through the door of Suite 1028 with a Mac Book in hand and have a question about how to model your data in Riak, we’ll get out the whiteboard and help you out.

Second Order of Business

To make an immediate impact in the Bay Area, we thought it would be a great idea to get the first regularly scheduled Riak Meetup off the ground. We heard a rumor that there were a lot of people using or interested in databases out here, so we feel obliged to join the conversation. Here is the link to the San Francisco Riak Meetup group. If you’re in the Bay Area and want to meet with other like-minded developers and technologists to discuss Riak (and other database technologies) in every possible capacity, please join us.

Third Order of Business

Pop quiz: When did Basho Technologies open source Riak? We asked ourselves this the other day. As far we can tell, it was sometime during the first week and a half of August last year. “Huh,” we thought. “Wouldn’t it be great to have a little gathering to commemorate this event?” It sure would, so that’s what we are doing.

I mentioned above that we are starting a regularly scheduled Riak Meetup. To us, it made perfect sense to combine the inaugural Meetup with the event to celebrate Riak’s One Year Anniversary of being a completely open source technology.

The date of this gathering is Monday, August 9th. The exact time and location still needs to be solidified. We’ll be announcing that within the next few days. But put it on your calendar now, as you will not want to miss this. In addition to food, drink, and exceptional overall technical discussion and fireworks, here is what you can expect:

  • A talk from Dr. Eric Brewer, Basho Board Member and Father of the CAP Theorem
  • A few words from the team at Mochi Media about their experiences running Riak in production
  • A short talk from Basho’s VP of Engineering, Andy Gross, on the state of Riak and the near term road map

If you have any other suggestions about what you would like to see at this event, just leave us a message or an idea on the Meetup page linked above.

Let’s review:

  1. Come visit the new Basho Office at 795 Folsom, Suite 1028
  2. Join the Riak Meetup Group
  3. Come be a part of the Riak One Year Anniversary Celebration

And stay tuned, because things are only going to get more exciting from here.

The Basho Team

Berlin Buzzwords Day One Recap

June 8, 2010

This was originally posted by @rklophaus on his blog, rklophaus.comBerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.

Check back for a writeup on part 2.

Day 1 – June 7, 2010

The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.

Keynote: Grant Ingersoll from Lucid Imagination

Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.

The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.

  • Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
  • 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.

There are multiple levels to intelligence:

  • Level 1 – Finding, organizing, discovering and associating data
  • Level 2 – Collecting and personalizing data
  • Level 3 – Mining data for sentiment and semantics
  • Level 4 – Learning from data, extracting ideas
  • Level 5 – Reasoning about data

At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.

The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.

NoSQL: The Definitive Guide – Mathias Mayer

Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.

Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.

Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.

Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.

One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.

What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.

One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.

His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.

Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt

Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.

According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.

CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.

Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.

Riak from Small to Large – Rusty Klophaus

In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.

I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.

In single node clusters, Riak provides a simple data model, with key/value access, a variety of client libraries (Python, PHP, Java, Javascript, Erlang), and configurable replication settings and backing datastores.

In small-to-medium sized clusters, Riak provides a way to take advantage of hardware in parallel, with Javascript-based Map/Reduce, well-behaved HTTP (allowing easy placement of proxies and caches), and Google Protobuffs support.

And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.

Finally, I ended with a 5-minute run through of how to use the Python client API to read/write an object, run a linkwalking operation, and run a Javascript-based map/reduce operation.

Slides are available on Slideshare.

Realtime Search with Lucene – Michel Busch

Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.

For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.

Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.

A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.

ElasticSearch – Shay Banon

I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.

Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.

The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).

ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.

To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.

The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.

Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.

Nutch as a Web Mining Platform – Andrzej Bialecki

Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.

The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.

Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.

Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.

Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.

Riak Search – Rusty Klophaus

In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.

Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.

Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.

Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.

The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.

Slides available soon.


Talks I Wished I Had Attended

The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Finite-State Queries in Lucene – Robert Muir
  • Text and metadata extraction with Apache Tika – Jukka Zitting
  • MetaCarta GeoSearch Toolkit for Solr – James Goodwin
  • The return of the Hierarchical Model – Jukka Zitting
  • Five cool problems you can solve with Neo4J – Peter Neubauer

Riak in Production – A Distributed Event Registration System Written in Erlang

March 20, 2010

Riak, at its core, is an open source project. So, we love the opportunity to hear from our users and find out where and how they are using Riak in their applications. It is for that reason that we were excited to hear from Chris Villalobos. He recently put a Distributed Event Registration application into production at his church in Gainesville, Florida, and after hearing a bit about it, we asked him to write a short piece about it for the Basho Blog.

Use Case and Prototyping

As a way of going paperless at our church, I was tasked with creating an event registration system that was accessible via touchscreen kiosk, SMS, and our website, to be used by members to sign up for various events. As I was wanting to learn a new language and had dabbled in Erlang (specifically Mochiweb) for another small application, I decided that I was going to try and do the whole thing in Erlang. But how to do it, and on a two month time line, was quite the challenge.

The initial idea was to have each kiosk independently hold pieces of the database, so that in the event something happened to a server or a kiosk, the data would still be available. Also, I wanted to use the fault-tolerance and distributed processing of Erlang to help make sure that the various frontends would be constantly running and online. And, as I wanted to stay as close to pure Erlang as possible, I decided early against a SQL database. I tried Mnesia but I wasn’t happy with the results. Using QLC as an interface, interesting issues arose when I took down a master node. (I was also facing a time issue so playing with it extensively wasn’t really an option.)

It just so happened that Basho released Riak 0.8 the morning I got fed up with it. So I thought about how I could use a key/value store. I liked how the Riak API made it simple to get data in and out of the database, how I could use map-reduce functionality to create any reports I needed and how the distribution of data worked out. Most importantly, no matter what nodes I knocked out while the cluster was running, everything just continued to click. I found my datastore.

During the initial protoyping stages for the kiosk, I envisioned a simple key/value store using a data model that looked something like this:

{key1, {Title, Icon, Background Image, Description, [signup_options]}},
{key2, {…}}

This design would enable me to present the user with a list of options when the kiosk was started up. I found that by using Riak, this was simple to implement. I also enjoyed that Riak was great at getting out of the way. I didn’t have to think about how it was going to work, I just knew that it would. ( The primary issue I kept running into when I thought about future problems was sibling entries. If two users on two kiosks submit information at the same time for the same entry, (potentially an issue as the number of kiosks grow), then that would result in sibling entries because of the way user data is stored:

<>, <>, [user data]

But, by checking for siblings when the reports are generated, this problem became a non-issue.)

High Level Architecture

The kiosk is live and running now with very few kinks (mostly hardware) and everything is in pure Erlang. At a high level, the application architecture looks like this:

Each Touchscreen Kiosk:

  • wxErlang
  • Riak node

Web-Based Management/SMS Processing Layer:

  • Nitrogen Framework speaking to Riak for Kiosk Configuration/Reporting
  • Nitrogen/Mochiweb processing SMS messages from SMS aggregator

Periodic Email Sender:

  • Vagabond’s gen_smtp client on a eternal receive after 24 hours send email-loop.

In Production

Currently, we are running four Riak nodes (writing out to the Filesystem backend) outside of the three Kiosks themselves. I also have various Riak nodes on my random linux servers because I can use the CPU cycles on my other nodes to distribute MapReduce functions and store information in a redundant fashion.

By using Riak, I was able to keep the database lean and mean with creative uses of keys. Every asset for the kiosk is stored within Riak, including images. These are pulled only whenever a kiosk is started up or whenever an asset is created, updated, or removed (using message passing). If an image isn’t present on a local kiosk, it is pulled from the database and then stored locally. Also, all images and panels (such as the on-screen keyboard) are stored in memory to make things faster.

All SMS messages are stored within an SMS bucket. Every 24 hours all the buckets are checked with a “mapred_bucket” to see if there are any new messages since the last time the function ran. These results are formatted within the MapReduce function and emailed out using the gen_smtp client. As assets are removed from the system, the current data is stored within a serialized text file and then removed the database.

As I bring more kiosks into operation, the distributed map-reduce feature is becoming more valuable. Since I typically run reports during off hours, the kiosks aren’t overloaded by the extra processing power. So far I have been able to roll out a new kiosk within 2 hours of receiving the hardware. Most of this time is spent doing the installation and configuration of the touchscreen. Also, the system is becoming more and more vital to how we are interfacing with people, giving members multiple ways of contacting us at their convenience. I am planning on expanding how I use the system, especially with code-distribution. For example, with the Innostore interface, I might store the beam files inside and send them to the kiosks using Erlang commands. (Version Control inside Riak, anyone?)

What’s Next?

I have ambitious plans for the system, especially on the kiosk side. As this is a very beta version of the software, it is only currently in production in our little community. That said, I hope to open source it and put it on github/bitbucket/etc. as soon as I pretty up all the interfaces.

I’d say probably the best thing about this whole project is getting to know the people inside the Erlang community, especially the Basho people and the #erlang regulars on IRC. Anytime I had a problem, someone was there willing to work through it with me. Since I am essentially new to Erlang, it really helped to have a strong sense of community. Thank you to all the folks at Basho for giving me a platform to show what Erlang can do in everyday, out of the way places.

Chris Villalobos

Basho Podcast Three – An Introduction To Innostore

February 2, 2010

You may remember that Basho recently open-sourced Innostore, our standalone Erlang application that provides a simple interface to embedded InnoDB…

In this podcast, Dave “Dizzy” Smith and Justin Sheehy discuss the release of Innostore, why we built it, how we use it in Riak, and why it might be useful for other Erlang projects. The discussion focuses on the stability and predictability of InnoDB, especially under load and as compared with other storage backends like DETS.

And of course, go download Innostore when you are done with the podcast.



If you are having problems getting the podcast to play, click here to play in new window or right click to download the podcast.