June 9, 2010

This was originally posted by @rklophaus on his blog, rklophaus.com.

Day 2 – June 8, 2010

Keynote – Pieter Hintjens

Pieter (iMatix) started his talk with a series of high-level questions, developer-to-developer, intended to focus the audience on the fact that multi-core processing across multiple computers is the new norm, and (most) programming tools haven’t yet evolved to meet the challenge.

He then identified and discussed some of the natural patterns in software development that make things simpler. After a few examples relating to the NoSQL world, he identified three that led into his introduction of 0MQ (Pronounced Zero-M-Q):

  • Asynchronous processing is a natural pattern, single threads that read from a queue, do work, and write to another queue.
  • Choosing reliability over persistence is a natural pattern. If you don’t crash, you don’t need to worry about persisting data. (Not sure how I feel about this one. Just gonna roll with it for the sake of the talk.)
  • Being agnostic to lines of communication is a natural pattern, what you send is orthogonal to how you send it.

Pieter then introduced the 0MQ library, which attempts to be a simple and lightweight message queue following these natural patterns. It takes care of defining queue endpoints, connecting (and re-connecting) the endpoints, buffering messages in memory, and not much else. The data format is simply a length and a binary blob, that’s all.

According to Pieter, 0MQ should be thought of as a protocol, just like TCP or UDP. In other words, 0MQ is the sort of thing you embed in your database application.

With 0MQ, you can safely create multi-threaded applications that safely leverage multiple cores by making each worker process single threaded, and have it read from a queue, perform some unit of work, and write to another queue. (In other words, the Actor model, concurrency by message passing.) The idea is not new, but it bears repeating as often as possible because it’s far simpler than multithreaded systems with locking, 99% of the time it’s the right solution, and many people still don’t know it.

Pieter had two choice quotes that drove home the main goals of 0MQ:

  • “If a developer can’t pick it up in a weekend, it’s not going to work.”
  • “Cheap effective messaging changes the way we think about architecting applications.”

Hypertable – Doug Judd

I unfortunately missed the first few minutes of Doug’s talk. When I arrived, Doug (Hypertable) was in the midst of an architectural overview of the Google stack, BigTable architecture, the ideas behind a log-structured merge tree, and examples of Hypertable optimizations, including bloom filters and using different compression algorithms in different parts of the system.

The money slides came toward the end, with performance comparisions claiming Hypertable to be 70% faster than HBase on random reads and sequential writes. Another chart claimed Hypertable to be multiple times faster than HBase when doing only random reads of different distributions.

Apart from the previously mentioned optimizations, there seem to be two main reasons for Hypertable’s speed: It’s based in C vs. HBase’s Java, and it is smart enough to dynamically adjust memory between caching reads and buffering writes according to the read/write distribution of the data.

And yes, Hypertable works with Hadoop for Map/Reduce-ing goodness…

Apache Cassandra Revisited – Eric Evans

Eric quickly focusing the talk by narrowing from All-of-NoSQL to Just-the-Large-Data-Projects, and then finally Just-the-BigTable-or-Dynamo-projects, which means Cassandra, HBase, Hypertable, Riak, and Voldemort.

Given these criteria, Eric called Cassandra the love-child of BigTable AND Dynamo, having influences from both. As such it has Dynamo staples like homogonous nodes, P2P-routing and partitioning (though not VNodes), and things like SSTables and (optionally) ordered data and range queries, similar to BigTable. (His slides contained a humorous, yet distrurbing picture showing a Brad Pitt/Angelina Jolie mutant child.)

Eric described the bootstrap process, the Cassandra data model (Keyspace, Column Family, Record, Column), and the interface (Thrift), and showed API examples.

He then highlighted a few key Cassandra developments and features:

  • Cassandra does now support batch Map/Reduce via Hadoop.
  • Cassandra comes with rack awareness, and this can be customized.
  • Keyspaces and Column families, currently defined in XML, will soon be configurable via an API without a restart.
  • Vector clocks will be added in the future. (But in his view, the hype outweights the benefit.)
  • SuperColumns may be phased out in the future, as they are not widely used, and lead to more confusion than they are worth.

According to Eric, the largest Cassandra instance that he knows of is Twitter, with around 100 nodes holding about 170TB of data.

Massively Parallel Analytics Beyond Map/Reduce – Fabian Huske

Fabian (TU Berlin) began by describing some of the challenges behind Map/Reduce, namely that it does make big data processing more simple than it used to be, but it still requires a developer to fit his problem into something Map/Reduce shaped, and this is exacerbated by the complexities of the various Map/Reduce frameworks out there.

Fabian then introduced Stratosphere, which is a combination programming model (PACT) and execution engine (Nephele) that provides additional blocks beyond Map and Reduce that can be used instead of a simple Map or Reduce, with the dual goals of making it easier to program as well as require fewer execution phases leading to higher performance. Stratosphere is a result of combining Map/Reduce with parallel database technology.

As an example, Fabian showed a SQL task that could be converted to two Map/Reduce jobs that with Stratosphere could be made simpler using Stratosphere.

A few examples: with PACT, you have new second-order functions in which to put your user code such as operations for “cross” (compute a cartesian cross-product of inputs), “match” (compute only where input keys from both sources match), and “cogroup” (missed this one). Building more complex second-order functions allows for less user code.

Next steps for the projcet are more input contracts, flexible checkpointing and recovery, and robust and adaptive execution, with a goal of going open-source by the end of 2010.

Sqoop – Database Import and Export for Hadoop – Aaron Kimball

Aaron (Cloudera) set the stage with a quick run-down of the limits of the SQL world, and the plusses and minuses of Hadoop, which lead to the introduction of Sqoop (SQL in Hadoop).

Sqoop provides a suite of tools to connect Hadoop to a JDBC-compliant SQL database, extract data and schema information, import the data into Hadoop, auto-generate code to parse the data, and export any results back into the SQL database.

The goal is to make it easier to pull SQL-hosted data into your Hadoop-cluster for the purpose of having the data available while doing other processing. For example, clickstream data might be in Hadoop, while profile information is in SQL. With Sqoop, you can get the data into Hadoop in an efficient way to support analysis. Copying the data from SQL in one operation is better than repeatedly hitting the database while running analysis because a big Hadoop cluster can easily hose a SQL machine.

Sqoop has some complexity under the hood:

  • It can export data definitions to Hive (see writeup next.)
  • It reads from/write to SQL in parellel.
  • You can use a SELECT/WHERE query to get data, which allows you to run Sqoop incrementally, fetching new data since the last run.
  • Supports mysqldump and mysqlexport.


Hive: SQL for Hadoop – Sarah Sproehnle

Sarah (Cloudera) described Hive, a parser, optimizer, compiler, and shell for transforming SQL-like queries into Map/Reduce. With Hive, you think of your data as being in tables rather than files, so you create tables, load data from a local file or Hadoop file into the table, and can then run SQL-like queries.

(I used the word “SQL-like” above, but Hive queries are actually standards compliant SQL, with just a few limitations/twists. Anyone who knows SQL at any level can pick up the changes in just a few minutes.)

In other words, with Hive you can:

  • CREATE and DESCRIBE tables, and ADD and DROP columns on tables.
  • EXPLAIN queries.
  • Query Hadoop data using SELECT, TOP, FROM, JOIN, WHERE, GROUP BY, and ORDER BY.
  • Write Hadoop data using INSERT (though the insert actually means “clobber the old data and replace it with this new data”)
  • In a twist on standard SQL, run a multi-table INSERT, where you split a single SELECT/FROM into multiple output streams, allowing you to write different columns to different tables. You can also, further filter, group, or transform the data in each stream independently before writing to the final table.
  • Run data through a custom shell script, expecting lines of data on stdin, results on stdout.
  • Partition and bucket data, allowing for easy ways to drop a subset of data, or take a sampling of data.

Hive gives you the convenience of SQL, but at the end of the day it’s still running as a Map/Reduce job on Hadoop, which means:

  • No transactions.
  • Latencies measured in minutes, not milliseconds.
  • No indexes, think of everything as a full-table scan.

Not surprising, and not bad considering you can run a SQL query across Petabytes of data.

The Hive install is installed on the client, so you don’t need to do anything to the Hadoop cluster to run it. Hive keeps schema information in a Metastore, which can be kept on the local machine without any special configuration, or shared in a central repository allowing multiple users to share Hive table definitions. The schema is verified at data read time, not when the schema is created. Again, this makes sense given Hadoop’s execution model.

1,000 points to Sarah for running a live demo during the presentation. Gutsy, but always a crowd pleaser.

Talks I Wished I Had Attended

The conference schedule today had two tracks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Hadoop: An Industry Perspective – Aaron Kimball
  • Behemith: A Hadoop-based platform for large scale document processing – Julian Nioche
  • Introduction to Collaborative Filtering using Mahout – Frank Scholten

Closing Session

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the wrap-up short, thanking the other organizers, the tech staff (who gave a quick, fun recap of network usage), the venue, the presenters, and the audience.

When Jan asked who wanted to go to BerlinBuzzwords 2011 next year, every hand in the room shot up.

Final Thoughts

BerlinBuzzwords was an amazing conference. Half of the credit goes to the organizers for picking a great venue and interesting presenters. The other half goes to the largely German/European audience, who, 99% of the time, were focused on the presentation with laptops closed and (often) paper notepads open. This level of engagement lead to great questions from the audience after each presentation, and lots of hallway interaction. Sign me up for next year!