Tag Archives: conference

Basho is Speaking at Google I/O

May 15, 2013

Google I/O, Google’s popular developer conference, is happening in San Francisco now through Friday, May 17. This conference features speakers from various industries as well as code labs and developer demos.

We are excited to announce that Basho’s own Tyler Hannan will be speaking on a panel entitled, “Distributed Databases Panel: An Exploration of Approaches and Best Practices.” This panel will take place on Thursday, May 16th at 12:45pm as part of the Google Cloud Platform track.

Joining Tyler on this panel will be:

  • Julia Ferraioli, Developer Advocate for Google Compute Engine
  • Brian Dorsey, Developer Programs Engineer for the Google Developer Relations Team
  • Chris Ramsdale, Product Manager for Google Cloud Platform
  • Mike Miller, Chief Scientist at Cloudant
  • Will Shulman, CEO and Co-Founder of MongoLab

If you’re at Google I/O this week, we hope to see you at our panel! For a full list of where Basho will be, check out our Events Page.

Basho

RICON East Livestream Available

May 13, 2013

Today is the kick off of our distributed systems conference, RICON East! This two-day conference is taking over New World Stages and features a great line up of speakers ranging from academia to industry, talking about the theory, practice and importance of running distributed systems in production.

The opening keynote will be given by Dr. Margo Seltzer, whose talk, Automatically Scalable Computation, discusses the current limitations of distributed computation and, based off research being conducted at Harvard, offers one potential solution. You can also check out the full schedule of speakers here.

If you weren’t able to get tickets this year, don’t worry. You can watch all of the great talks via our live stream at ricon.io/live.html.

Also, if you’re going to be in San Francisco on October 29-30, there will be some exciting updates about RICON West coming soon.

See you at RICON!

Basho

RICON East Speakers Announced

April 11, 2013

On May 13-14, RICON East will take place in New York City – with tickets still available here. RICON is Basho’s series of distributed system conferences for developers. We first launched RICON last October at the sold out San Francisco show. This year, we have three conferences scheduled across the globe, with the first in New York.

RICON East will bring together developers, engineers, architects, and scientists to discuss Riak, as well as key emerging research areas and approaches to solving the challenges faced by the industry today.

Earlier this week, the confirmed speaker line-up was released and can be found here. Here’s a look at some of the speakers:

  • Dr. Margo L. Seltzer, Professor at Harvard University
  • Rich Hickey, Creator of Clojure, Datomic
  • Camille Fournier, VP of Architecture at Rent the Runway
  • Hilary Mason, Chief Scientist at bitly
  • Theo Schlossnagle, Founder and CEO at OmniTI
  • Shawn Gravelle, IT Architect at State Farm Insurance
  • Ed Laczynski, VP of Cloud Strategy and Architecture at Datapipe
  • Brian Akins, Chief Operations Engineer at Turner Broadcasting System
  • Sathish Gaddipati, VP of Enterprise Data at The Weather Channel
  • Michajlo Matijkiw, Senior Software Engineer at Comcast

Many Basho engineers will also be speaking throughout the conference, including: Andy Gross, Sean Cribbs, Matthew Von-Maszewski, Ryan Zezeski, and Chris Tilt.

If you still haven’t purchased your tickets, there are still some available here! Also check out some of last year’s amazing talks or reach out to Mark Phillips if you’re interested in group ticket discounts or sponsorships opportunities

See you in New York!

Basho

Basho Announces Initial Speaker Line-Up for RICON East

Over 30 speakers from bitly, Comcast, The Weather Channel, State Farm Insurance, Turner Broadcasting System, Harvard University, and more to discuss the future of distributed systems.

New York City, NY – April 8, 2013Basho, the worldwide leader in distributed database and cloud storage software, announced today the initial speaker line up for RICON East. RICON is Basho’s global conference series that is dedicated to distributed systems and is designed by and for engineers, developers, data scientists, and architects. RICON East is being held May 13-14 in New York City, NY. Basho expects to assemble hundreds of the industry’s most influential thinkers and practitioners devoted to deploying distributed systems technologies, including NoSQL solutions and Cloud Storage.

Speakers include:
Dr. Margo L. Seltzer, Harvard University
Rich Hickey, Creator of Clojure, Datomic
Camille Fournier, Rent the Runway
Alex Payne, Breather
Hilary Mason, bitly
Theo Schlossnagle, OmniTI
Robert Treat, OmniTI
Neha Narula, Massachusetts Institute of Technology (MIT)
Neil Conway, UC Berkeley
Kyle Kingsbury, Factual
Shawn Gravelle, State Farm Insurance
Sam Townsend, State Farm Insurance
Ed Laczynski, Datapipe
Brian Akins, Turner Broadcasting System
Sathish Gaddipati, The Weather Channel
Michajlo Matijkiw, Comcast
Mark Wunsch, Gilt Groupe

Basho engineers will be featured prominently throughout RICON East. Basho speakers include: Andy Gross, Sean Cribbs, Matthew Von-Maszewski, Ryan Zezeski, Chris Tilt.

RICON East builds on Basho’s highly successful, sold-out RICON 2012 event held Fall 2012 in San Francisco. Presentations from RICON 2012 are available to view at www.ricon2012.com.

Ticket Information
Tickets are available online at http://ricon.io/east.html. Student discount prices are available online. For other discounts, including discounts for large groups, contact Mark Phillips at mark@basho.com.

Sponsorship information
Initial sponsors of RICON East include Fastly, Meraki, Engine Yard, Github and NoSQLWeekly. For more information on sponsorship opportunities, contact Tom Santero at tsantero@basho.com.

About Basho Technologies
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed NoSQL database, Riak, and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by over 25% of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.

Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com.

Basho is headquartered in Cambridge, Massachusetts and has offices in London, San Francisco, Tokyo and Washington DC.

Rovio and Riak at GDC This Week

March 26, 2013

Basho is a proud sponsor of the Game Developers Conference (GDC), happening this week in San Francisco. GDC is the world’s largest and longest running game industry event.

On Wednesday, March 27 at 11:00am, Basho Chief Architect, Andy Gross, and Rovio Entertainment Product Manager, Timo Herttua, will be co-presenting in room 3022 of the West Hall. The talk will discuss how Rovio uses Riak, the data storage requirements for today’s high-scale and high-performance gaming platforms, and how to run distributed systems in production.

Angry Birds developer Rovio is using Riak as the database supporting its new mobile gaming platform, including features such as payments, game state storage, and push notifications. The Croods game, launched last week, was the first to use this new platform. The Croods game is now available on Android and iOS.

If you’re at GDC this week, make sure to stop by our booth (#202), get a t-shirt, and learn how top gaming companies are using Riak.

If you’re not at GDC, but would like to learn more about Riak and how it can be used for gaming platforms, download “Gaming on Riak: A Technical Introduction,” or visit our gaming page.

Basho

Update on RICON | EAST: CFP, Early Bird, and Talk Announcements

Last year, Basho held a widely-acclaimed conference, RICON2012, where leading technologists gave insightful talks and shared ideas about Basho’s distributed database Riak and, more broadly, the distributed systems space.

The conference will once again host developers, engineers, architects, and scientists talking about Riak as well as key emerging research areas and approaches to solving the challenges faced by the industry today. Learn how some of the smartest people in the world are solving some of the hardest problems in the world.

Early bird ticket sales have begun and talk proposals are welcomed at ricon@basho.com. Please note that the deadline for CFPs is March 15th.

Watch the official RICON blog for speaker announcements.

To get a better idea of what RICON is all about, recorded talks from RICON2012 can be found on the RICON website or Vimeo. Expect to be inspired and receive a fashionable hoodie — with your Twitter/GitHub handle along the side.

RICON Hoodie

Basho’s European Conference Schedule

May 22, 2012

A handful of the Basho team are descending on Europe, attending and speaking at various conferences and meetups, and we couldn’t be more excited to meet and mingle with the growing European Riak community.

Here’s the full list of places and events where we’ll be. If you will be at any of these events and want to talk Riak, we want to hear from you! Send us a tweet or email Tom at tsantero@basho.com

GOTO Copenhagen

May 21-23

GOTO Copenhagen began yesterday and runs through May 23. The GOTO conference series are international events put on for, and by, software developers. This year’s theme is “Real Stories from Real People” and attendees can expect to learn about solving real life problems form real life experiences from a number of leading experts and authors.

Put these talks in your calendar:

Basho will also have a booth on the exhibition floor. Be sure to stop by in between talks to chat with Ian or Tom about Riak, distributed systems or your favorite fancy cocktail.

GOTO Amsterdam

May 24-25

GOTO is on a roll this year, with two European conferences scheduled back to back. GOTO Amsterdam is hosted at Beurs van Berlage. For a two day conferences, the list of speakers is quite impressive so kudos to the GOTO Program Advisory Board for putting this one together.

In addition to a booth on the exhibition floor, Andy Gross will deliver the following talk that cannot be missed:

Be sure to corner [Andy](https://twitter.com/#!/argv0) or [Tom](https://twitter.com/#!/tsantero) in between sessions and ask them hard questions about Riak.

### NoSQL Matters

**May 29-30**

Basho’s [Tom Santero](https://twitter.com/#!/tsantero) will be attending [NoSQL Matters](http://www.nosql-matters.org/) set to take place in Cologne, Germany. This is a brand new conference, and we have very high expectations for success considering the caliber of [speakers](http://www.nosql-matters.org/speakers/) on deck.

If you’re in attendance, be sure not to miss these talks from members of the Riak community:

* [Designing for Concurrency with Riak](http://www.nosql-matters.org/agenda/) – Mathias Meyer
* [Theoretical Aspects of Distributed Systems, Playfully Illustrated](http://www.nosql-matters.org/agenda/) – Pavlo Baron

### Erlang User Conference

**May 28 – June 1**

Stockholm plays host to this year’s [Erlang User Conference](http://www.erlang-factory.com/conference/ErlangUserConference2012). The events put on by Erlang Solutions are usually exceptional, and Basho will be well represented this year.

The conference itself last for two days, Monday and Tuesday, followed by a day of tutorials on Wednesday and then wrapped up with two days of workshops on Thursday and Friday.

We’ll be delivering the following talks:

* [Sweden's Next Top NoSQL Data Model](http://www.erlang-factory.com/conference/ErlangUserConference2012/speakers/IanPlosker) – Ian Plosker
* [Innovation: What Every Developer Absolutely Needs To Know](http://www.erlang-factory.com/conference/ErlangUserConference2012/speakers/SteveVinoski) – Steve Vinoski

Basho’s VP of Engineering, [Dizzy Smith](https://twitter.com/#!/dizzyd), will host a [tutorial](http://www.erlang-factory.com/conference/ErlangUserConference2012/speakers/DizzySmith) demonstrating [Rebar](https://github.com/basho/rebar), an open-source build-system for Erlang/OTP applications.

[Ian Plosker](https://twitter.com/#!/dstroyallmodels) will be running a two day class on
[Building distributed clusters with Riak](http://www.erlang-factory.com/conference/ErlangUserConference2012/university/RiakTraining). Everyone who attends will walk away with a very clear understanding of just why Riak is the best distributed database you will ever run in production.

### London Riak Meetup

**May 30**

Basho is pleased to announce that Ian Plosker will be hosting the [Inaugural London Riak Meetup](http://www.meetup.com/riak-london/events/62061262/).

This first meetup in London will feature a talk by Basho’s VP of Engineering, [Dizzy Smith](https://twitter.com/#!/dizzyd) and is hosted in Google’s new [co-working space](http://www.campuslondon.com/).

If you’re in or around London on May 30, missing this is not optional.

*Don’t forget to [RSVP](http://www.meetup.com/riak-london/events/62061262/).*

### EuRuKo

**June 1-2**

[EuRuKo](http://www.euruko2012.org/) is an annual Ruby conference hosted in Amsterdam. All attendees can expect a killer [venue](http://www.euruko2012.org/#venue), awesome [lineup of speakers](http://www.euruko2012.org/#speakers) and a fancy Boat Party sponsored by GitHub.

[Sean Cribbs](https://twitter.com/#!/seancribbs) will give a talk titled *A Case of Accidental Concurrency* – if you haven’t been lucky enough to hear Sean speak in person before, than you’re in for a real treat.

### Berlin Buzzwords

**June 4-5**

Last, but certainly not least, we’ll be at [Berlin Buzzwords](http://berlinbuzzwords.de/) for two days of [brilliant technologists](http://berlinbuzzwords.de/speakers), [hackathons](http://berlinbuzzwords.de/wiki/hackathons) and training.
The theme of this conference is “search”, “store” and “scale”, our natural habitat so to speak.

You’ll get to hear form the Basho team:

* [Germany's Next Top Data Model](http://berlinbuzzwords.de/sessions/germanys-next-top-data-model) – Ian Plosker
* [Eventually Consistent Data Structures](http://berlinbuzzwords.de/sessions/eventually-consistent-data-structures) – Sean Cribbs

As well as the Riak community:

* [From Hand to Mouth](http://berlinbuzzwords.de/sessions/hand-mouth) – Pavlo Baron

[Tom](https://twitter.com/#!/tsantero)

Berlin Buzzwords Day Two Recap

June 9, 2010

This was originally posted by @rklophaus on his blog, rklophaus.com.

Day 2 – June 8, 2010

Keynote – Pieter Hintjens

Pieter (iMatix) started his talk with a series of high-level questions, developer-to-developer, intended to focus the audience on the fact that multi-core processing across multiple computers is the new norm, and (most) programming tools haven’t yet evolved to meet the challenge.

He then identified and discussed some of the natural patterns in software development that make things simpler. After a few examples relating to the NoSQL world, he identified three that led into his introduction of 0MQ (Pronounced Zero-M-Q):

  • Asynchronous processing is a natural pattern, single threads that read from a queue, do work, and write to another queue.
  • Choosing reliability over persistence is a natural pattern. If you don’t crash, you don’t need to worry about persisting data. (Not sure how I feel about this one. Just gonna roll with it for the sake of the talk.)
  • Being agnostic to lines of communication is a natural pattern, what you send is orthogonal to how you send it.

Pieter then introduced the 0MQ library, which attempts to be a simple and lightweight message queue following these natural patterns. It takes care of defining queue endpoints, connecting (and re-connecting) the endpoints, buffering messages in memory, and not much else. The data format is simply a length and a binary blob, that’s all.

According to Pieter, 0MQ should be thought of as a protocol, just like TCP or UDP. In other words, 0MQ is the sort of thing you embed in your database application.

With 0MQ, you can safely create multi-threaded applications that safely leverage multiple cores by making each worker process single threaded, and have it read from a queue, perform some unit of work, and write to another queue. (In other words, the Actor model, concurrency by message passing.) The idea is not new, but it bears repeating as often as possible because it’s far simpler than multithreaded systems with locking, 99% of the time it’s the right solution, and many people still don’t know it.

Pieter had two choice quotes that drove home the main goals of 0MQ:

  • “If a developer can’t pick it up in a weekend, it’s not going to work.”
  • “Cheap effective messaging changes the way we think about architecting applications.”

Hypertable – Doug Judd

I unfortunately missed the first few minutes of Doug’s talk. When I arrived, Doug (Hypertable) was in the midst of an architectural overview of the Google stack, BigTable architecture, the ideas behind a log-structured merge tree, and examples of Hypertable optimizations, including bloom filters and using different compression algorithms in different parts of the system.

The money slides came toward the end, with performance comparisions claiming Hypertable to be 70% faster than HBase on random reads and sequential writes. Another chart claimed Hypertable to be multiple times faster than HBase when doing only random reads of different distributions.

Apart from the previously mentioned optimizations, there seem to be two main reasons for Hypertable’s speed: It’s based in C vs. HBase’s Java, and it is smart enough to dynamically adjust memory between caching reads and buffering writes according to the read/write distribution of the data.

And yes, Hypertable works with Hadoop for Map/Reduce-ing goodness…

Apache Cassandra Revisited – Eric Evans

Eric quickly focusing the talk by narrowing from All-of-NoSQL to Just-the-Large-Data-Projects, and then finally Just-the-BigTable-or-Dynamo-projects, which means Cassandra, HBase, Hypertable, Riak, and Voldemort.

Given these criteria, Eric called Cassandra the love-child of BigTable AND Dynamo, having influences from both. As such it has Dynamo staples like homogonous nodes, P2P-routing and partitioning (though not VNodes), and things like SSTables and (optionally) ordered data and range queries, similar to BigTable. (His slides contained a humorous, yet distrurbing picture showing a Brad Pitt/Angelina Jolie mutant child.)

Eric described the bootstrap process, the Cassandra data model (Keyspace, Column Family, Record, Column), and the interface (Thrift), and showed API examples.

He then highlighted a few key Cassandra developments and features:

  • Cassandra does now support batch Map/Reduce via Hadoop.
  • Cassandra comes with rack awareness, and this can be customized.
  • Keyspaces and Column families, currently defined in XML, will soon be configurable via an API without a restart.
  • Vector clocks will be added in the future. (But in his view, the hype outweights the benefit.)
  • SuperColumns may be phased out in the future, as they are not widely used, and lead to more confusion than they are worth.

According to Eric, the largest Cassandra instance that he knows of is Twitter, with around 100 nodes holding about 170TB of data.

Massively Parallel Analytics Beyond Map/Reduce – Fabian Huske

Fabian (TU Berlin) began by describing some of the challenges behind Map/Reduce, namely that it does make big data processing more simple than it used to be, but it still requires a developer to fit his problem into something Map/Reduce shaped, and this is exacerbated by the complexities of the various Map/Reduce frameworks out there.

Fabian then introduced Stratosphere, which is a combination programming model (PACT) and execution engine (Nephele) that provides additional blocks beyond Map and Reduce that can be used instead of a simple Map or Reduce, with the dual goals of making it easier to program as well as require fewer execution phases leading to higher performance. Stratosphere is a result of combining Map/Reduce with parallel database technology.

As an example, Fabian showed a SQL task that could be converted to two Map/Reduce jobs that with Stratosphere could be made simpler using Stratosphere.

A few examples: with PACT, you have new second-order functions in which to put your user code such as operations for “cross” (compute a cartesian cross-product of inputs), “match” (compute only where input keys from both sources match), and “cogroup” (missed this one). Building more complex second-order functions allows for less user code.

Next steps for the projcet are more input contracts, flexible checkpointing and recovery, and robust and adaptive execution, with a goal of going open-source by the end of 2010.

Sqoop – Database Import and Export for Hadoop – Aaron Kimball

Aaron (Cloudera) set the stage with a quick run-down of the limits of the SQL world, and the plusses and minuses of Hadoop, which lead to the introduction of Sqoop (SQL in Hadoop).

Sqoop provides a suite of tools to connect Hadoop to a JDBC-compliant SQL database, extract data and schema information, import the data into Hadoop, auto-generate code to parse the data, and export any results back into the SQL database.

The goal is to make it easier to pull SQL-hosted data into your Hadoop-cluster for the purpose of having the data available while doing other processing. For example, clickstream data might be in Hadoop, while profile information is in SQL. With Sqoop, you can get the data into Hadoop in an efficient way to support analysis. Copying the data from SQL in one operation is better than repeatedly hitting the database while running analysis because a big Hadoop cluster can easily hose a SQL machine.

Sqoop has some complexity under the hood:

  • It can export data definitions to Hive (see writeup next.)
  • It reads from/write to SQL in parellel.
  • You can use a SELECT/WHERE query to get data, which allows you to run Sqoop incrementally, fetching new data since the last run.
  • Supports mysqldump and mysqlexport.

 

Hive: SQL for Hadoop – Sarah Sproehnle

Sarah (Cloudera) described Hive, a parser, optimizer, compiler, and shell for transforming SQL-like queries into Map/Reduce. With Hive, you think of your data as being in tables rather than files, so you create tables, load data from a local file or Hadoop file into the table, and can then run SQL-like queries.

(I used the word “SQL-like” above, but Hive queries are actually standards compliant SQL, with just a few limitations/twists. Anyone who knows SQL at any level can pick up the changes in just a few minutes.)

In other words, with Hive you can:

  • CREATE and DESCRIBE tables, and ADD and DROP columns on tables.
  • EXPLAIN queries.
  • Query Hadoop data using SELECT, TOP, FROM, JOIN, WHERE, GROUP BY, and ORDER BY.
  • Write Hadoop data using INSERT (though the insert actually means “clobber the old data and replace it with this new data”)
  • In a twist on standard SQL, run a multi-table INSERT, where you split a single SELECT/FROM into multiple output streams, allowing you to write different columns to different tables. You can also, further filter, group, or transform the data in each stream independently before writing to the final table.
  • Run data through a custom shell script, expecting lines of data on stdin, results on stdout.
  • Partition and bucket data, allowing for easy ways to drop a subset of data, or take a sampling of data.

Hive gives you the convenience of SQL, but at the end of the day it’s still running as a Map/Reduce job on Hadoop, which means:

  • No transactions.
  • Latencies measured in minutes, not milliseconds.
  • No indexes, think of everything as a full-table scan.

Not surprising, and not bad considering you can run a SQL query across Petabytes of data.

The Hive install is installed on the client, so you don’t need to do anything to the Hadoop cluster to run it. Hive keeps schema information in a Metastore, which can be kept on the local machine without any special configuration, or shared in a central repository allowing multiple users to share Hive table definitions. The schema is verified at data read time, not when the schema is created. Again, this makes sense given Hadoop’s execution model.

1,000 points to Sarah for running a live demo during the presentation. Gutsy, but always a crowd pleaser.

Talks I Wished I Had Attended

The conference schedule today had two tracks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Hadoop: An Industry Perspective – Aaron Kimball
  • Behemith: A Hadoop-based platform for large scale document processing – Julian Nioche
  • Introduction to Collaborative Filtering using Mahout – Frank Scholten

Closing Session

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the wrap-up short, thanking the other organizers, the tech staff (who gave a quick, fun recap of network usage), the venue, the presenters, and the audience.

When Jan asked who wanted to go to BerlinBuzzwords 2011 next year, every hand in the room shot up.

Final Thoughts

BerlinBuzzwords was an amazing conference. Half of the credit goes to the organizers for picking a great venue and interesting presenters. The other half goes to the largely German/European audience, who, 99% of the time, were focused on the presentation with laptops closed and (often) paper notepads open. This level of engagement lead to great questions from the audience after each presentation, and lots of hallway interaction. Sign me up for next year!

Berlin Buzzwords Day One Recap

June 8, 2010

This was originally posted by @rklophaus on his blog, rklophaus.comBerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.

Check back for a writeup on part 2.

Day 1 – June 7, 2010

The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.

Keynote: Grant Ingersoll from Lucid Imagination

Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.

The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.

  • Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
  • 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.

There are multiple levels to intelligence:

  • Level 1 – Finding, organizing, discovering and associating data
  • Level 2 – Collecting and personalizing data
  • Level 3 – Mining data for sentiment and semantics
  • Level 4 – Learning from data, extracting ideas
  • Level 5 – Reasoning about data

At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.

The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.

NoSQL: The Definitive Guide – Mathias Mayer

Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.

Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.

Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.

Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.

One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.

What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.

One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.

His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.

Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt

Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.

According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.

CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.

Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.

Riak from Small to Large – Rusty Klophaus

In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.

I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.

In single node clusters, Riak provides a simple data model, with key/value access, a variety of client libraries (Python, PHP, Java, Javascript, Erlang), and configurable replication settings and backing datastores.

In small-to-medium sized clusters, Riak provides a way to take advantage of hardware in parallel, with Javascript-based Map/Reduce, well-behaved HTTP (allowing easy placement of proxies and caches), and Google Protobuffs support.

And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.

Finally, I ended with a 5-minute run through of how to use the Python client API to read/write an object, run a linkwalking operation, and run a Javascript-based map/reduce operation.

Slides are available on Slideshare.

Realtime Search with Lucene – Michel Busch

Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.

For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.

Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.

A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.

ElasticSearch – Shay Banon

I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.

Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.

The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).

ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.

To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.

The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.

Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.

Nutch as a Web Mining Platform – Andrzej Bialecki

Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.

The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.

Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.

Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.

Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.

Riak Search – Rusty Klophaus

In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.

Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.

Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.

Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.

The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.

Slides available soon.

 

Talks I Wished I Had Attended

The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Finite-State Queries in Lucene – Robert Muir
  • Text and metadata extraction with Apache Tika – Jukka Zitting
  • MetaCarta GeoSearch Toolkit for Solr – James Goodwin
  • The return of the Hierarchical Model – Jukka Zitting
  • Five cool problems you can solve with Neo4J – Peter Neubauer

SWAG Alert — Riak at Velocity

June 6, 2010

Velocity, the “Web Performance and Operations Conference” put on by O’Reilly, kicks off tomorrow and we here at Basho are excited. Why? Because Benjamin Black, acknowledged distributed systems expert, will be giving a 90 minute tutorial on Riak. The official name of the session is called “Riak: From Design to Deploy.” If you haven’t already read it, you can get the full description of the session here.

I just got a sneak peek at what Benjamin has planned and all I can say is that if you are within 100 miles of Santa Clara, CA tomorrow and not in this session, you will regret it.

And, what better to go with a hands on Riak tutorial than some good old fashioned SWAG? Here is a little offer to anyone attending tomorrow: post a write up of Benjamin’s session and I’ll send you a Riak SWAG pack. It doesn’t have to be a novel, just a few thoughts will do. Post them somewhere online for all the world to see and learn from, and I’ll take care of the rest.

Enjoy Velocity. We are looking forward to your reviews!

Mark Phillips
Community Manager