June 8, 2010
This was originally posted by @rklophaus on his blog, rklophaus.com. BerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.
Check back for a writeup on part 2.
Day 1 – June 7, 2010
The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)
Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.
Keynote: Grant Ingersoll from Lucid Imagination
Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.
The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.
- Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
- 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.
There are multiple levels to intelligence:
- Level 1 – Finding, organizing, discovering and associating data
- Level 2 – Collecting and personalizing data
- Level 3 – Mining data for sentiment and semantics
- Level 4 – Learning from data, extracting ideas
- Level 5 – Reasoning about data
At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.
The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.
NoSQL: The Definitive Guide – Mathias Mayer
Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.
Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.
Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.
Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.
One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.
What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.
One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.
His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.
Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt
Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.
According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.
CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.
Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.
Riak from Small to Large – Rusty Klophaus
In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.
I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.
And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.
Slides are available on Slideshare.
Realtime Search with Lucene – Michel Busch
Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.
For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.
Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.
A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.
ElasticSearch – Shay Banon
I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.
Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.
The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).
ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.
To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.
The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.
Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.
Nutch as a Web Mining Platform – Andrzej Bialecki
Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.
The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.
Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.
Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.
Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.
Riak Search – Rusty Klophaus
In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.
Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.
Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.
Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.
The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.
Slides available soon.
Talks I Wished I Had Attended
The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:
- Finite-State Queries in Lucene – Robert Muir
- Text and metadata extraction with Apache Tika – Jukka Zitting
- MetaCarta GeoSearch Toolkit for Solr – James Goodwin
- The return of the Hierarchical Model – Jukka Zitting
- Five cool problems you can solve with Neo4J – Peter Neubauer