June 9, 2010
Day 2 – June 8, 2010
Keynote – Pieter Hintjens
Pieter (iMatix) started his talk with a series of high-level questions, developer-to-developer, intended to focus the audience on the fact that multi-core processing across multiple computers is the new norm, and (most) programming tools haven’t yet evolved to meet the challenge.
He then identified and discussed some of the natural patterns in software development that make things simpler. After a few examples relating to the NoSQL world, he identified three that led into his introduction of 0MQ (Pronounced Zero-M-Q):
- Asynchronous processing is a natural pattern, single threads that read from a queue, do work, and write to another queue.
- Choosing reliability over persistence is a natural pattern. If you don’t crash, you don’t need to worry about persisting data. (Not sure how I feel about this one. Just gonna roll with it for the sake of the talk.)
- Being agnostic to lines of communication is a natural pattern, what you send is orthogonal to how you send it.
Pieter then introduced the 0MQ library, which attempts to be a simple and lightweight message queue following these natural patterns. It takes care of defining queue endpoints, connecting (and re-connecting) the endpoints, buffering messages in memory, and not much else. The data format is simply a length and a binary blob, that’s all.
According to Pieter, 0MQ should be thought of as a protocol, just like TCP or UDP. In other words, 0MQ is the sort of thing you embed in your database application.
With 0MQ, you can safely create multi-threaded applications that safely leverage multiple cores by making each worker process single threaded, and have it read from a queue, perform some unit of work, and write to another queue. (In other words, the Actor model, concurrency by message passing.) The idea is not new, but it bears repeating as often as possible because it’s far simpler than multithreaded systems with locking, 99% of the time it’s the right solution, and many people still don’t know it.
Pieter had two choice quotes that drove home the main goals of 0MQ:
- “If a developer can’t pick it up in a weekend, it’s not going to work.”
- “Cheap effective messaging changes the way we think about architecting applications.”
Hypertable – Doug Judd
I unfortunately missed the first few minutes of Doug’s talk. When I arrived, Doug (Hypertable) was in the midst of an architectural overview of the Google stack, BigTable architecture, the ideas behind a log-structured merge tree, and examples of Hypertable optimizations, including bloom filters and using different compression algorithms in different parts of the system.
The money slides came toward the end, with performance comparisions claiming Hypertable to be 70% faster than HBase on random reads and sequential writes. Another chart claimed Hypertable to be multiple times faster than HBase when doing only random reads of different distributions.
Apart from the previously mentioned optimizations, there seem to be two main reasons for Hypertable’s speed: It’s based in C vs. HBase’s Java, and it is smart enough to dynamically adjust memory between caching reads and buffering writes according to the read/write distribution of the data.
And yes, Hypertable works with Hadoop for Map/Reduce-ing goodness…
Apache Cassandra Revisited – Eric Evans
Eric quickly focusing the talk by narrowing from All-of-NoSQL to Just-the-Large-Data-Projects, and then finally Just-the-BigTable-or-Dynamo-projects, which means Cassandra, HBase, Hypertable, Riak, and Voldemort.
Given these criteria, Eric called Cassandra the love-child of BigTable AND Dynamo, having influences from both. As such it has Dynamo staples like homogonous nodes, P2P-routing and partitioning (though not VNodes), and things like SSTables and (optionally) ordered data and range queries, similar to BigTable. (His slides contained a humorous, yet distrurbing picture showing a Brad Pitt/Angelina Jolie mutant child.)
Eric described the bootstrap process, the Cassandra data model (Keyspace, Column Family, Record, Column), and the interface (Thrift), and showed API examples.
He then highlighted a few key Cassandra developments and features:
- Cassandra does now support batch Map/Reduce via Hadoop.
- Cassandra comes with rack awareness, and this can be customized.
- Keyspaces and Column families, currently defined in XML, will soon be configurable via an API without a restart.
- Vector clocks will be added in the future. (But in his view, the hype outweights the benefit.)
- SuperColumns may be phased out in the future, as they are not widely used, and lead to more confusion than they are worth.
According to Eric, the largest Cassandra instance that he knows of is Twitter, with around 100 nodes holding about 170TB of data.
Massively Parallel Analytics Beyond Map/Reduce – Fabian Huske
Fabian (TU Berlin) began by describing some of the challenges behind Map/Reduce, namely that it does make big data processing more simple than it used to be, but it still requires a developer to fit his problem into something Map/Reduce shaped, and this is exacerbated by the complexities of the various Map/Reduce frameworks out there.
Fabian then introduced Stratosphere, which is a combination programming model (PACT) and execution engine (Nephele) that provides additional blocks beyond Map and Reduce that can be used instead of a simple Map or Reduce, with the dual goals of making it easier to program as well as require fewer execution phases leading to higher performance. Stratosphere is a result of combining Map/Reduce with parallel database technology.
As an example, Fabian showed a SQL task that could be converted to two Map/Reduce jobs that with Stratosphere could be made simpler using Stratosphere.
A few examples: with PACT, you have new second-order functions in which to put your user code such as operations for “cross” (compute a cartesian cross-product of inputs), “match” (compute only where input keys from both sources match), and “cogroup” (missed this one). Building more complex second-order functions allows for less user code.
Next steps for the projcet are more input contracts, flexible checkpointing and recovery, and robust and adaptive execution, with a goal of going open-source by the end of 2010.
Sqoop – Database Import and Export for Hadoop – Aaron Kimball
Aaron (Cloudera) set the stage with a quick run-down of the limits of the SQL world, and the plusses and minuses of Hadoop, which lead to the introduction of Sqoop (SQL in Hadoop).
Sqoop provides a suite of tools to connect Hadoop to a JDBC-compliant SQL database, extract data and schema information, import the data into Hadoop, auto-generate code to parse the data, and export any results back into the SQL database.
The goal is to make it easier to pull SQL-hosted data into your Hadoop-cluster for the purpose of having the data available while doing other processing. For example, clickstream data might be in Hadoop, while profile information is in SQL. With Sqoop, you can get the data into Hadoop in an efficient way to support analysis. Copying the data from SQL in one operation is better than repeatedly hitting the database while running analysis because a big Hadoop cluster can easily hose a SQL machine.
Sqoop has some complexity under the hood:
- It can export data definitions to Hive (see writeup next.)
- It reads from/write to SQL in parellel.
- You can use a SELECT/WHERE query to get data, which allows you to run Sqoop incrementally, fetching new data since the last run.
- Supports mysqldump and mysqlexport.
Hive: SQL for Hadoop – Sarah Sproehnle
Sarah (Cloudera) described Hive, a parser, optimizer, compiler, and shell for transforming SQL-like queries into Map/Reduce. With Hive, you think of your data as being in tables rather than files, so you create tables, load data from a local file or Hadoop file into the table, and can then run SQL-like queries.
(I used the word “SQL-like” above, but Hive queries are actually standards compliant SQL, with just a few limitations/twists. Anyone who knows SQL at any level can pick up the changes in just a few minutes.)
In other words, with Hive you can:
- CREATE and DESCRIBE tables, and ADD and DROP columns on tables.
- EXPLAIN queries.
- Query Hadoop data using SELECT, TOP, FROM, JOIN, WHERE, GROUP BY, and ORDER BY.
- Write Hadoop data using INSERT (though the insert actually means “clobber the old data and replace it with this new data”)
- In a twist on standard SQL, run a multi-table INSERT, where you split a single SELECT/FROM into multiple output streams, allowing you to write different columns to different tables. You can also, further filter, group, or transform the data in each stream independently before writing to the final table.
- Run data through a custom shell script, expecting lines of data on stdin, results on stdout.
- Partition and bucket data, allowing for easy ways to drop a subset of data, or take a sampling of data.
Hive gives you the convenience of SQL, but at the end of the day it’s still running as a Map/Reduce job on Hadoop, which means:
- No transactions.
- Latencies measured in minutes, not milliseconds.
- No indexes, think of everything as a full-table scan.
Not surprising, and not bad considering you can run a SQL query across Petabytes of data.
The Hive install is installed on the client, so you don’t need to do anything to the Hadoop cluster to run it. Hive keeps schema information in a Metastore, which can be kept on the local machine without any special configuration, or shared in a central repository allowing multiple users to share Hive table definitions. The schema is verified at data read time, not when the schema is created. Again, this makes sense given Hadoop’s execution model.
1,000 points to Sarah for running a live demo during the presentation. Gutsy, but always a crowd pleaser.
Talks I Wished I Had Attended
The conference schedule today had two tracks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:
- Hadoop: An Industry Perspective – Aaron Kimball
- Behemith: A Hadoop-based platform for large scale document processing – Julian Nioche
- Introduction to Collaborative Filtering using Mahout – Frank Scholten
Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the wrap-up short, thanking the other organizers, the tech staff (who gave a quick, fun recap of network usage), the venue, the presenters, and the audience.
When Jan asked who wanted to go to BerlinBuzzwords 2011 next year, every hand in the room shot up.
BerlinBuzzwords was an amazing conference. Half of the credit goes to the organizers for picking a great venue and interesting presenters. The other half goes to the largely German/European audience, who, 99% of the time, were focused on the presentation with laptops closed and (often) paper notepads open. This level of engagement lead to great questions from the audience after each presentation, and lots of hallway interaction. Sign me up for next year!
June 8, 2010
This was originally posted by @rklophaus on his blog, rklophaus.com. BerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.
Check back for a writeup on part 2.
Day 1 – June 7, 2010
The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)
Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.
Keynote: Grant Ingersoll from Lucid Imagination
Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.
The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.
- Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
- 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.
There are multiple levels to intelligence:
- Level 1 – Finding, organizing, discovering and associating data
- Level 2 – Collecting and personalizing data
- Level 3 – Mining data for sentiment and semantics
- Level 4 – Learning from data, extracting ideas
- Level 5 – Reasoning about data
At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.
The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.
NoSQL: The Definitive Guide – Mathias Mayer
Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.
Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.
Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.
Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.
One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.
What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.
One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.
His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.
Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt
Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.
According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.
CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.
Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.
Riak from Small to Large – Rusty Klophaus
In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.
I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.
And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.
Slides are available on Slideshare.
Realtime Search with Lucene – Michel Busch
Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.
For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.
Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.
A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.
ElasticSearch – Shay Banon
I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.
Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.
The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).
ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.
To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.
The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.
Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.
Nutch as a Web Mining Platform – Andrzej Bialecki
Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.
The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.
Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.
Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.
Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.
Riak Search – Rusty Klophaus
In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.
Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.
Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.
Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.
The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.
Slides available soon.
Talks I Wished I Had Attended
The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:
- Finite-State Queries in Lucene – Robert Muir
- Text and metadata extraction with Apache Tika – Jukka Zitting
- MetaCarta GeoSearch Toolkit for Solr – James Goodwin
- The return of the Hierarchical Model – Jukka Zitting
- Five cool problems you can solve with Neo4J – Peter Neubauer