Author Archives: Shanley

Erlang Factory London Recap

June 14, 2010

This was originally posted by @rklophaus on his blog, rklophaus.com.

Erlang Factory London gathers Erlang pioneers from across the world—Berlin to Boston, Krakow to Cordoba, and San Francisco to Shanghai—for a two-day conference of innovative Erlang development.

The summaries below are just a small sampling of the talks at Erlang Factory London. There were three tracks running back-to-back for two days, and I often couldn’t decide which of the three to attend. Slides and videos will be released by Erlang Solutions, and can be found under individual track pages on the Erlang Factory website.

Day 1 – June 10, 2010

Opening Session

Francesco Cessarini (Chief Strategy Officer, Erlang Solutions Ltd.), began the conference with a warm welcome and a quick review of progress made by Erlang-based companies in the last year.

Some highlights:

The History of the Erlang Virtual Machine – Joe Armstrong, Robert Virding

Joe Armstrong and Robert Virding gave a colorful, back-and-forth history of the Erlang’s birth and early years. A few notable milestones and achievements:

  • Joe’s early work on reduction machines. Robert’s complete rewrite of Joe’s work. Joe’s complete rewrite of Robert’s work. (etc.)
  • How Erlang was almost based on Smalltalk rather than Prolog
  • The quest to make Erlang 1.0x 80 times faster
  • Experiments with different memory management and garbage collection schemes
  • The train set used demonstrate Erlang, now in Robert’s basement
  • The addition of linked processes, distribution, OTP, and bit syntax

It’s easy to take a language like Erlang for granted and assume that its builders followed some well-known, pre-ordained path. Hearing Erlang’s history from two of its main creators provided an excellent reminder that building software is both an art and a science, uncertain and exciting like any creative process.

Riak from the Inside – Justin Sheehy

Justin Sheehy (CTO of Basho Technologies) opened his talk by introducing Riak, “a scalable, highly-available, networked, open-source key/value store.” He then very quickly announced that he wasn’t there to talk about using Riak, he was there to talk about how Riak was built using Erlang and OTP

There are eight distinct layers involved in reading/writing Riak data:

  • The Client Application using Riak
  • The client-side HTTP API or Protocol Buffers API that talks to the Riak cluster
  • The server-side Riak Client containing the combined backing code for both APIs
  • The Dynamo Model FSMs that interact with nodes using Dynamo style quorum behavior and conflict resolution
  • Riak Core provides the fundamental distribution of the system (not covered in the talk)
  • The VNode Master that runs on every physical node, and coordinates incoming interaction with individual VNodes
  • Individual VNodes (Virtual Nodes) which are treated as lightweight local abstractions over K/V storage
  • The swappable Storage Engine that persists data to disk

During his talk, Justin discussed each layer’s responsibilities and interactions with the layers above and below it.

Justin’s main point is that carefully managed complexity in the middle layers allows for simplicity at the edge layers. The top three layers present a simple key/value interface, and the bottom two layers implement a simple key/value store. The middle layers (FSMs, Riak Core, and VNode Master) work together to provide scalability, replication, etc. Erlang makes this possible, and was chosen because it provides a platform that evolves in useful and relatively-predictable ways (this is a good thing, a surprising evolution is bad).

Mnesia for the CAPper – Ulf Wiger

Ulf Wiger (CTO of Erlang Solutions) discussed where Mnesia might fit into the changing world of databases, given the new focus on “NoSQL” solutions. Ulf gave a quick introduction to ACID properties, Brewer’s CAP theorem, and the history of Mnesia, and then dove into a feature level description/comparison of Mnesia with other databases:

  • Deployed commercially for over 10 years
  • Comparable performance to current top performers clustered SQL space
  • Scalable to 50 nodes
  • Distributed transactions with loose time limits (in other words, appropriate for transactions across remote clusters)
  • Built-in support for sharding (fragments)
  • Incremental backup

The downsides are:

  • Erlang only interface
  • Tables limited to 2GB
  • Deadlock prevention scales poorly
  • Network partitions are not automatically handled, must recombine tables automatically

Ulf and others have done work to get around some of these limitations. Ulf showed code for an extension to Mnesia that automatically merges tables after they have split, using vector clocks.

Riak Search – Rusty Klophaus

I presented Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.

Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.

Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.

Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field. Part three continued with a deep-dive into parsing, planning, and executing the search query on Erlang.

Slides: http://www.slideshare.net/rklophaus/riak-search-erlang-factory-london-2010

Building a Scalable E-commerce Framework – Michael Nordström and Daniel Widgren

Michael Nordström and Daniel Widgren presented an Erlang-based e-commerce framework on behalf of their project team from Uppsala University (Christian Rennerskog, Shahzad Gul, Nicklas Nordenmark,
Manzoor Ahmad Mubashir, Mikael Nordström, Kim Honkaniemi, Tanvir Ahmad, Yujuan Zou, and Daniel Widgren) and their industrial partner, Klarna AB.

The application uses a “LERN stack” (Linux, Erlang, Riak, Nitrogen), to provide a reusable web shop that can be quickly set up by clients, customized via templates and themes, and extended via plugins to support different payment providers.

The project is currently going a rewrite to update to the latest versions of Riak and Nitrogen.

GitHub: http://github.com/mino4071/CookieCart-2.0

Twitter: @Cookie_Cart

Clash of the Titans: Erlang Clusters and Google App Engine – Panos Papadopoulos, Jon Vlachoyiannis, Nikos Kakavoulis

Panos, Jon, and Nikos took turns describing the technical evolution of their startup, SocialCaddy, and why they were forced to move away from the Google App Engine. SocialCaddy is a tool that mines your online profiles for important events and changes, and tells you about them. For example, if a friend gets engaged, SocialCaddy will tell you about it, and assist you in sending a congratulatory note.

Google App Engine imposes a 30 second limit on requests. As SocialCaddy processed larger and larger social graphs, they bumped into this limit, which made GAE unusable as a platform. In response, the team developed Erlust, which allows you to submit jobs (written in any language) to a cluster. An Erlang application coordinates the jobs, and each job should read from a queue, process messages, and write to another queue.

Using Open-Source Trifork QuickCheck to test Erjang – Kresten Krab Thorup

Kresten Krab Thorup (CTO of Trifork) stirred up dust when he originally announced his intention to build a version of Erlang that ran on the JVM. Since then, he has made astounding progress. Erjang turns Erlang .beam files into Java .class files, now supporting a broad enough feature set to run Mnesia over distributed Erlang. Kresten claimed performance matching (or at times exceeding) that of the Erlang VM.

Erjang is still a work in progress, there are many BIFs that still need to be ported, but if a prototype exists to prove viability, then this prototype was certainly a success. One slide showed the code for the spawn_link function reimplemented in Java in ~15 lines of simple Java code.

For the second half of his talk, Kresten showed off Triq (short for Trifork Quickcheck), a scaled-down, open-source QuickCheck inspired testing framework that he built in order to test Erjang. Triq supports basic generators (called domains), value picking, and shrinking. Kresten showed examples of using Triq to validate that Erjang performs binary operations with the exact same results as Erlang.

More information about Erjang here: http://wiki.github.com/krestenkrab/erjang/

Day 2 – June 11, 2010

Efene: A Programming Language for the Erlang VM – Mariano Guerra

Mariano Guerra presented Efene, a new language that is translated into Erlang source code. Efene is intended to help coax developers into the world of Erlang who might otherwise be intimidated by the Prolog-inspired syntax of Erlang. We’ve heard about a number of other projects compiling into Erlang byte-code (such as Reia and Lisp-Flavored Erlang), but Efene takes a different approach in that the language is parsed and translated using Leex and Yecc into standard Erlang code, which is then compiled as normal. By doing this, Mariano manages to leave most of the heavy lifting of optimizations to the existing Erlang compiler.

Efene actually supports two different syntax flavors, one with curly brackets, the other without, leading to a syntax that feels vaguely like Javascript or Python, respectively. (The syntax without curly brackets is called Ifene, for “Indented Efene”, and is otherwise identical to Efene.)

In some places, Efene syntax is a bit more verbose than Erlang. This is done to make the language more readable than Erlang. (“if” and “case” statements have more structure in Efene than Erlang.) In other places, Efene requires less typing, multi-claused function definitions don’t require you to repeat the function name, for example.

Code samples and more information: http://marianoguerra.com.ar/efene

Erlang in Embedded Systems – Gustav Simonsson, Henrik Nordh, Fredrik Andersson, Fabian Bergstrom, Niclas Axelsson and Christofer Ferm

Gustav, Henrik, Fredrik, Fabian, Niclas, and Christofer (Uppsala University), in cooperation with Erlang Solutions, worked on a project to shrink the Erlang VM (plus the Linux platform on which it runs) down to the smallest possible footprint for use on Gumstix and BeagleBoard hardware.

The team experimented with OpenEmbedded and Angstrom, using BusyBox, uClibc, and stripped .beam files to further decrease the footprint. During the presentation, they played a video showing how to install Erlang on a Gumstix single-board computer in 5 minutes using their work.

More information about Embedded Erlang here: http://embedded-erlang.org

Zotonic: Easy Content Management with Erlang’s Performance and Flexibility – Marc Worrell

Marc Worrell (WhatWebWhat) breaks CMSs into:

  • 1st Generation – Static text and images
  • 2nd Generation – Database- and template-driven systems (covers current CMS systems)
  • 3rd Generation – Highly interactive, real-time, personalized data exchanges and frameworks

Zotonic is aimed squarely at the third generation, Zotonic turns a CMS into a living, breathing thing, where modules on a page talk to each other and other sessions via comet, and the system can be easily extended, blurring the line between CMS and application framework.

This interactivity is what motivated Marc to write the system in Erlang; at one point he compared the data flowing through the system to a telephone exchange. Zotonic uses Webmachine, Mochiweb, ErlyDTL, and a number of other Erlang libraries, with data in PostgreSQL. (Marc also mentioned Nitrogen as an early inspiration for Zotonic, parts of Zotonic are based on Nitrogen code, though much has been rewritten.)

The data model is physically simple, with emergent functionality. A site is developed in terms of objects (called pages) interlinked with other objects. In other words, from a data perspective, adding an image to a web page is the same as linking from a page to a subpage, or tagging a page with an author. Mark gave a live demo of Zotonic’s ability to easily add and change menu structures, modify content, and add and re-order images. Almost everything can be customized using ErlyDTL templates. Very polished stuff.

Marc then introduced his goal of “Elastic Zotonic”, a Zotonic that can scale in a distributed, fault-tolerant, “buzzword-compliant” way, which will involve changes to the datastore and some of the architecture.

Marc is now working with Maximonster to develop an education-oriented social network on top of Zotonic.

More information: http://zotonic.com

Closing Session

Francesco (CSO, Erlang Solutions, Ltd.) thanked the sponsors, presenters, and audience. Frank then gave a big special thanks to Frank Knight and Joanna Włodarczyk, who both worked tirelessly to organize the conference and make everything go smoothly.

Final Thoughts

Erlang is gaining momentum in the industry as a platform that enables you to solve distributed, massively concurrent problems. People aren’t flocking directly to Erlang itself, they are instead flocking to projects built in Erlang, such as RabbitMQ, ejabberd, CouchDB, and of course, Riak. At the same time, other languages are adopting some of the key features that make Erlang special, including a message-passing architecture and lightweight threads.

Berlin Buzzwords Day Two Recap

June 9, 2010

This was originally posted by @rklophaus on his blog, rklophaus.com.

Day 2 – June 8, 2010

Keynote – Pieter Hintjens

Pieter (iMatix) started his talk with a series of high-level questions, developer-to-developer, intended to focus the audience on the fact that multi-core processing across multiple computers is the new norm, and (most) programming tools haven’t yet evolved to meet the challenge.

He then identified and discussed some of the natural patterns in software development that make things simpler. After a few examples relating to the NoSQL world, he identified three that led into his introduction of 0MQ (Pronounced Zero-M-Q):

  • Asynchronous processing is a natural pattern, single threads that read from a queue, do work, and write to another queue.
  • Choosing reliability over persistence is a natural pattern. If you don’t crash, you don’t need to worry about persisting data. (Not sure how I feel about this one. Just gonna roll with it for the sake of the talk.)
  • Being agnostic to lines of communication is a natural pattern, what you send is orthogonal to how you send it.

Pieter then introduced the 0MQ library, which attempts to be a simple and lightweight message queue following these natural patterns. It takes care of defining queue endpoints, connecting (and re-connecting) the endpoints, buffering messages in memory, and not much else. The data format is simply a length and a binary blob, that’s all.

According to Pieter, 0MQ should be thought of as a protocol, just like TCP or UDP. In other words, 0MQ is the sort of thing you embed in your database application.

With 0MQ, you can safely create multi-threaded applications that safely leverage multiple cores by making each worker process single threaded, and have it read from a queue, perform some unit of work, and write to another queue. (In other words, the Actor model, concurrency by message passing.) The idea is not new, but it bears repeating as often as possible because it’s far simpler than multithreaded systems with locking, 99% of the time it’s the right solution, and many people still don’t know it.

Pieter had two choice quotes that drove home the main goals of 0MQ:

  • “If a developer can’t pick it up in a weekend, it’s not going to work.”
  • “Cheap effective messaging changes the way we think about architecting applications.”

Hypertable – Doug Judd

I unfortunately missed the first few minutes of Doug’s talk. When I arrived, Doug (Hypertable) was in the midst of an architectural overview of the Google stack, BigTable architecture, the ideas behind a log-structured merge tree, and examples of Hypertable optimizations, including bloom filters and using different compression algorithms in different parts of the system.

The money slides came toward the end, with performance comparisions claiming Hypertable to be 70% faster than HBase on random reads and sequential writes. Another chart claimed Hypertable to be multiple times faster than HBase when doing only random reads of different distributions.

Apart from the previously mentioned optimizations, there seem to be two main reasons for Hypertable’s speed: It’s based in C vs. HBase’s Java, and it is smart enough to dynamically adjust memory between caching reads and buffering writes according to the read/write distribution of the data.

And yes, Hypertable works with Hadoop for Map/Reduce-ing goodness…

Apache Cassandra Revisited – Eric Evans

Eric quickly focusing the talk by narrowing from All-of-NoSQL to Just-the-Large-Data-Projects, and then finally Just-the-BigTable-or-Dynamo-projects, which means Cassandra, HBase, Hypertable, Riak, and Voldemort.

Given these criteria, Eric called Cassandra the love-child of BigTable AND Dynamo, having influences from both. As such it has Dynamo staples like homogonous nodes, P2P-routing and partitioning (though not VNodes), and things like SSTables and (optionally) ordered data and range queries, similar to BigTable. (His slides contained a humorous, yet distrurbing picture showing a Brad Pitt/Angelina Jolie mutant child.)

Eric described the bootstrap process, the Cassandra data model (Keyspace, Column Family, Record, Column), and the interface (Thrift), and showed API examples.

He then highlighted a few key Cassandra developments and features:

  • Cassandra does now support batch Map/Reduce via Hadoop.
  • Cassandra comes with rack awareness, and this can be customized.
  • Keyspaces and Column families, currently defined in XML, will soon be configurable via an API without a restart.
  • Vector clocks will be added in the future. (But in his view, the hype outweights the benefit.)
  • SuperColumns may be phased out in the future, as they are not widely used, and lead to more confusion than they are worth.

According to Eric, the largest Cassandra instance that he knows of is Twitter, with around 100 nodes holding about 170TB of data.

Massively Parallel Analytics Beyond Map/Reduce – Fabian Huske

Fabian (TU Berlin) began by describing some of the challenges behind Map/Reduce, namely that it does make big data processing more simple than it used to be, but it still requires a developer to fit his problem into something Map/Reduce shaped, and this is exacerbated by the complexities of the various Map/Reduce frameworks out there.

Fabian then introduced Stratosphere, which is a combination programming model (PACT) and execution engine (Nephele) that provides additional blocks beyond Map and Reduce that can be used instead of a simple Map or Reduce, with the dual goals of making it easier to program as well as require fewer execution phases leading to higher performance. Stratosphere is a result of combining Map/Reduce with parallel database technology.

As an example, Fabian showed a SQL task that could be converted to two Map/Reduce jobs that with Stratosphere could be made simpler using Stratosphere.

A few examples: with PACT, you have new second-order functions in which to put your user code such as operations for “cross” (compute a cartesian cross-product of inputs), “match” (compute only where input keys from both sources match), and “cogroup” (missed this one). Building more complex second-order functions allows for less user code.

Next steps for the projcet are more input contracts, flexible checkpointing and recovery, and robust and adaptive execution, with a goal of going open-source by the end of 2010.

Sqoop – Database Import and Export for Hadoop – Aaron Kimball

Aaron (Cloudera) set the stage with a quick run-down of the limits of the SQL world, and the plusses and minuses of Hadoop, which lead to the introduction of Sqoop (SQL in Hadoop).

Sqoop provides a suite of tools to connect Hadoop to a JDBC-compliant SQL database, extract data and schema information, import the data into Hadoop, auto-generate code to parse the data, and export any results back into the SQL database.

The goal is to make it easier to pull SQL-hosted data into your Hadoop-cluster for the purpose of having the data available while doing other processing. For example, clickstream data might be in Hadoop, while profile information is in SQL. With Sqoop, you can get the data into Hadoop in an efficient way to support analysis. Copying the data from SQL in one operation is better than repeatedly hitting the database while running analysis because a big Hadoop cluster can easily hose a SQL machine.

Sqoop has some complexity under the hood:

  • It can export data definitions to Hive (see writeup next.)
  • It reads from/write to SQL in parellel.
  • You can use a SELECT/WHERE query to get data, which allows you to run Sqoop incrementally, fetching new data since the last run.
  • Supports mysqldump and mysqlexport.

 

Hive: SQL for Hadoop – Sarah Sproehnle

Sarah (Cloudera) described Hive, a parser, optimizer, compiler, and shell for transforming SQL-like queries into Map/Reduce. With Hive, you think of your data as being in tables rather than files, so you create tables, load data from a local file or Hadoop file into the table, and can then run SQL-like queries.

(I used the word “SQL-like” above, but Hive queries are actually standards compliant SQL, with just a few limitations/twists. Anyone who knows SQL at any level can pick up the changes in just a few minutes.)

In other words, with Hive you can:

  • CREATE and DESCRIBE tables, and ADD and DROP columns on tables.
  • EXPLAIN queries.
  • Query Hadoop data using SELECT, TOP, FROM, JOIN, WHERE, GROUP BY, and ORDER BY.
  • Write Hadoop data using INSERT (though the insert actually means “clobber the old data and replace it with this new data”)
  • In a twist on standard SQL, run a multi-table INSERT, where you split a single SELECT/FROM into multiple output streams, allowing you to write different columns to different tables. You can also, further filter, group, or transform the data in each stream independently before writing to the final table.
  • Run data through a custom shell script, expecting lines of data on stdin, results on stdout.
  • Partition and bucket data, allowing for easy ways to drop a subset of data, or take a sampling of data.

Hive gives you the convenience of SQL, but at the end of the day it’s still running as a Map/Reduce job on Hadoop, which means:

  • No transactions.
  • Latencies measured in minutes, not milliseconds.
  • No indexes, think of everything as a full-table scan.

Not surprising, and not bad considering you can run a SQL query across Petabytes of data.

The Hive install is installed on the client, so you don’t need to do anything to the Hadoop cluster to run it. Hive keeps schema information in a Metastore, which can be kept on the local machine without any special configuration, or shared in a central repository allowing multiple users to share Hive table definitions. The schema is verified at data read time, not when the schema is created. Again, this makes sense given Hadoop’s execution model.

1,000 points to Sarah for running a live demo during the presentation. Gutsy, but always a crowd pleaser.

Talks I Wished I Had Attended

The conference schedule today had two tracks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Hadoop: An Industry Perspective – Aaron Kimball
  • Behemith: A Hadoop-based platform for large scale document processing – Julian Nioche
  • Introduction to Collaborative Filtering using Mahout – Frank Scholten

Closing Session

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the wrap-up short, thanking the other organizers, the tech staff (who gave a quick, fun recap of network usage), the venue, the presenters, and the audience.

When Jan asked who wanted to go to BerlinBuzzwords 2011 next year, every hand in the room shot up.

Final Thoughts

BerlinBuzzwords was an amazing conference. Half of the credit goes to the organizers for picking a great venue and interesting presenters. The other half goes to the largely German/European audience, who, 99% of the time, were focused on the presentation with laptops closed and (often) paper notepads open. This level of engagement lead to great questions from the audience after each presentation, and lots of hallway interaction. Sign me up for next year!

Berlin Buzzwords Day One Recap

June 8, 2010

This was originally posted by @rklophaus on his blog, rklophaus.comBerlinBuzzwords has a stellar venue and talks describing cutting edge developments on all things search, scalability, and storage. My recap of Day 1 is below.

Check back for a writeup on part 2.

Day 1 – June 7, 2010

The conference check-in was seamless, with much swag including messenger bags and notebooks. The venue itself–Kosmos Club–was amazing. Kosmos Club was the biggest movie house in East Germany before the fall of the wall, and has since turned into an event venue. Lots of metallics, varied textures, and swank chandeliers, with two of the biggest presentation rooms I’ve ever seen (they used to be movie theatres.)

Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the opening remarks light and intimate. Overall, there were about 350 people in attendance.

Keynote: Grant Ingersoll from Lucid Imagination

Grant focused his talk around the words “Open”, “Scalable”, and “Intelligent”. He described how a number of things, such as big-data storage, search, and distributed computing have become commodities, but required a staff of Ph.D. level employees just a few years ago.

The main point of his talk is that openness and open-source, plus scalability, have turned these things into commodities, and that the next big, interesting thing to work on is data intelligence.

  • Data (produced per year, I believe) has grown from 161 exabytes in 2006 to ~1000 exabytes in 2010.
  • 85% of data we produce is unstructured, where unstructured may mean that we just aren’t yet smart enough to parse the data.

There are multiple levels to intelligence:

  • Level 1 – Finding, organizing, discovering and associating data
  • Level 2 – Collecting and personalizing data
  • Level 3 – Mining data for sentiment and semantics
  • Level 4 – Learning from data, extracting ideas
  • Level 5 – Reasoning about data

At this point in the talk, he switched to a description of Apache Mahout, a machine-learning engine. Mahout can do things such as recommendations, collaborative filtering, Bayesian analysis, Random Forests, discovery, and pattern matching. At some point, he mentioned things like Restricted Boltzmann Machines, Stachastic Gradient Dsecents, and Vector Machines as upcoming features.

The takeaway is that using Mahout, you can build open, scalable, intelligent apps right now. In practical terms, this means things like auto-suggest, auto-complete, content clustering, clickstream analysis, etc.

NoSQL: The Definitive Guide – Mathias Mayer

Mathias Meyer (Peritor Solutions, @roidrage) gave a very balanced view of the current state of NoSQL. Mathias gave history its proper respect, saying that much of what we view as “new” in NoSQL is actually older ideas in a prettier package. He mentioned things like Berkely DB (K/V store), Sybase (Column Store), Versant Object DB (Graph/Object Database), and Lotus Notes (Peer to Peer Document Database) as throwback examples.

Mathias said that relational databases tried to be a one-size-fits all solution, and that NoSQL is about removing constraints to speed up performance and development.

Mathias himself is a fan of CouchDB, Redis, and Riak, and wisely avoided giving specific recommendations on what projects someone should use. Each datastore handles different use cases, so use what is right for your data.

Mathias briefly touched on Voldemort, Tokyo, Redis, S3, Scalaris, Couhdb, Riak, Mongo, BigTable, Cassandra, HBase, HyperTable, Core Data, Neo4J, and HyperGraphDB. (In order of mention.) He then gave a slightly more in depth view of the replication and scaling models of both CouchDB and Dynamo.

One of the key things that NoSQL gets right, he said, is being constructed of open web technologies such as JSON, HTTP, links, and textual protocals.

What is hard for NoSQL now? Range queries, ad-hoc queries, and transactions, mainly because the NoSQL space is focused on scalability as a major goal.

One of the big points that I’m glad he brought up is, “As a developer, how do I know I’m not wasting my time on a NoSQL solution?” The key is that each of the different NoSQL projects was built to solve a real-world problem, so trust that somebody found it useful and needed it built.

His main point: NoSQL is not the Holy Grail. NoSQL should not be about replacing SQL. Instead, you need to be okay with having polyglot data storage. The data itself should dictate the datastore.

Making Software for Humans – CouchDB and the Usable Peer-To-Peer Web – Jan Lehnhardt

Jan Lehnardt (@janl, Couch.io), after a brief introduction of himself and Couch, led off with the statement that 80% of all NoSQL projects do the same thing as flat files. It’s the differences in the last 20% that really differentiate the projects. Therefore, NoSQL is about choice to build better systems. Each NoSQL project starts with a main idea.

According to Jan, CouchDB’s main idea is being ready to scale up, in that each node functions independently, and also ready to scale down, in that Couch is a great candidate for running on embedded mobile phones, routers, and other devices, as a way of synchronizing user content.

CouchDB’s synchronization is it’s killer feature. With Couch, you can subscribe to events like data updates that send out HTTP based notifications to other parts of your application. As an example, the Couch.io team has built a chat service based solely on writing Couch objects and receiving updates.

Jan wrapped up by touching on projects like Opera Unite as ahead-of-the-parade examples of what CouchDB is aiming to do, and is already doing, in projects like UbuntuOne–P2P data synchronization. He mentioned Facebook as an example of a centralized, closed web, Flickr as a centralized, but open service (since you can pull your data out), and Diaspora as the open web that follows the true vision of Tim Berners-Lee.

Riak from Small to Large – Rusty Klophaus

In my talk (Rusty Klophaus, Basho Technologies), I gave a brief description of how Riak differentiates itself in the NoSQL market. It was built first with the operations folks in mind, which makes sense given the Akamai background of the core developers, who understand and embrace the uncertainties of a distributed system.

I then described which features of Riak become important in single-node Riak clusters, three-node clusters, and ten-plus node clusters. Just like different features of your car are important going 50 m.p.h. vs. 100 m.p.h., different features of Riak are important at different cluster sizes.

In single node clusters, Riak provides a simple data model, with key/value access, a variety of client libraries (Python, PHP, Java, Javascript, Erlang), and configurable replication settings and backing datastores.

In small-to-medium sized clusters, Riak provides a way to take advantage of hardware in parallel, with Javascript-based Map/Reduce, well-behaved HTTP (allowing easy placement of proxies and caches), and Google Protobuffs support.

And in large-clusters, Riak provides an extremely easy operations story that can survive server outages and network partitions, and scale out by just running a few commands.

Finally, I ended with a 5-minute run through of how to use the Python client API to read/write an object, run a linkwalking operation, and run a Javascript-based map/reduce operation.

Slides are available on Slideshare.

Realtime Search with Lucene – Michel Busch

Michael Busch from Twitter discussed some upcoming changes to Lucene that allow it to search on data that has not yet been committed to disk. From my understanding, when Lucene commits a change it creates a segment and possibly merges multiple segments together. At that point, a reader can access the newly created segment.

For real-time search, the process needed to be shortened. The first attempt simply involved syncing out the changes when a reader was created. This didn’t work well. The next step was to actually search on the uncommitted index in memory. This was a challenge for a few reasons: first, Lucene uses multiple threads to update the index, so synchronizing those threads to provide the correct read-isolation is a problem. Second, the index maintains a large number of long lived objects in memory, and this causes inefficient garbage collection that kills performance.

Michael described a number of fixes that have already been written or are on their way, mostly around making multiple single-threaded index writers and changing the way postings are stored in memory which changes their structure from an unbounded number of objects to instead use a finite number of arrays. Effects on performance were amazing, especially for small memory sizes. A JVM with ~200 MB of RAM allocated was something like ~80% more performant.

A Twitter prototype with simultaneous indexes and queries showed that query performance, with the new modifications, is almost completely independent from query load, which is impressive. That said, the Twitter index has 32-bit postings (24 for DocID, 8 for term postition) so these results may not be the same for everyone.

ElasticSearch – Shay Banon

I unfortunately missed the first few minutes of Shay’s talk on ElasticSearch. ElasticSearch automates the partitioning, sharding, and replication of documents into Lucene indexes, and provides a unified interface for searches.

Shay described the JSON model that ElasticSearch uses for API access, which includes everything from queries and filters to creating new indexes.

The ElasticSearch distribution model works by posting a JSON document with the new index definition. ElasticSearch automatically balances the shards across available nodes, and it sounded like this is done in a node-aware way, so that replicas are stored on different nodes if possible. At index or query time, you can hit any node, and the node itself is responsible for directing the operation to the right place(s).

ElasticSearch supports per-document consistency (in other words, no commit/flush support.) The most interesting thing, to me, is that ElasticSearch embraces the idea of transient storage. In other words it was written with the intent of running on something like EC2, where you can’t trust local storage, and writing to remote storage is expensive, both computationally and monetarily.

To get around this potential bottleneck, ElasticSearch still supports reliable (or “somewhat reliable”, if you want to split hairs) persistence by assuming that not all replicas containing a node will fail at the same time. New documents are lazily logged to the backing store, and if the master node doing the logging dies, then one of its slaves will detect the death and finish the logging. When a new node starts up, it reads from the backing store.

The other interesting thing is that ElasticSearch is aware of different cloud providers (it sounded like EC2 and Rackspace Cloud) and consult the cloud provider itself to get a list of potential nodes, allowing it to self-assemble.

Key differences between ElasticSearch and Solr? Distributed model, different API, no facet support yet.

Nutch as a Web Mining Platform – Andrzej Bialecki

Andrzej Bialecki (SIGRAM) described Nutch, a distributed web-crawler/search engine built on top of Lucene. It provides the standard framework of a distributed search engine that is extensible by plug-ins, and handles things such as URL filtering, normalizing, depth vs. breadth first crawling, etc.

The presentation was eye-opening just to see all of the things that make web-crawling difficult that are NOT about storing data and serving up queries. The Nutch team has spent a large amount of time going into things like what happens when you encounter auto-generated sites, buggy web-servers, link-spammers, or other tar-pits during a crawl.

Some common techniques for “bootstrapping” a web crawler are to start with some high quality seed sites, which may be well-known, authoritative resources, reference sites, or even the top-N results from an existing search engine like Google.

Once you have your search data, Andrzej described ways to mine the data, such as using keyword, phrase, or anchor search, using facets to find latent topics, using top-N results to prioritize future crawling, mining incoming links, treating the web as a corpus of sample textual data, associating concepts, uncoving gossip, opinions, and sentiments from data.

Nutch is currently under a redesign, attempting to share more code with common crawler libraries and other projects. Part of this will be converting data storage to a well-defined layer, allowing for pluggable backends so that users can take advantage of native data tools for those backends.

Riak Search – Rusty Klophaus

In my second talk, I discussed Riak Search, a distributed indexing and full-text search engine built on (and complementary to) Riak.

Part one covered the main reason for building Riak search: clients have built applications that eventually need to find data by value, not just by key. This is difficult, if not impossible, in a key/value store.

Part two described the shape of the final solution we set out to create. The goal of Riak Search is to support the Lucene interface, with Lucene syntax support and Solr endpoints, but with the operations story of Riak. This means that Riak Search will scale easily by adding new machines, and will continue to run after machine failure.

Part three was an introduction to Inverted Indexing, which is the heart of all search systems, as well as the difference between Document-Partitioning and Term-Partitioning, which forms the ongoing battle in the distributed search field.

The tradeoffs are that Document-Partitioning generally has lower latency queries, but lower overall throughput due to it requiring a disk-seek on each partition. For this reason, Riak Search uses Term-Based partitioning, with some special optimizations using term-splitting, bloom filters, and result batching.

Slides available soon.

 

Talks I Wished I Had Attended

The conference schedule today had two tracks, Search and NoSQL, plus I presented two talks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:

  • Finite-State Queries in Lucene – Robert Muir
  • Text and metadata extraction with Apache Tika – Jukka Zitting
  • MetaCarta GeoSearch Toolkit for Solr – James Goodwin
  • The return of the Hierarchical Model – Jukka Zitting
  • Five cool problems you can solve with Neo4J – Peter Neubauer

SWAG Alert — Riak at Velocity

June 6, 2010

Velocity, the “Web Performance and Operations Conference” put on by O’Reilly, kicks off tomorrow and we here at Basho are excited. Why? Because Benjamin Black, acknowledged distributed systems expert, will be giving a 90 minute tutorial on Riak. The official name of the session is called “Riak: From Design to Deploy.” If you haven’t already read it, you can get the full description of the session here.

I just got a sneak peek at what Benjamin has planned and all I can say is that if you are within 100 miles of Santa Clara, CA tomorrow and not in this session, you will regret it.

And, what better to go with a hands on Riak tutorial than some good old fashioned SWAG? Here is a little offer to anyone attending tomorrow: post a write up of Benjamin’s session and I’ll send you a Riak SWAG pack. It doesn’t have to be a novel, just a few thoughts will do. Post them somewhere online for all the world to see and learn from, and I’ll take care of the rest.

Enjoy Velocity. We are looking forward to your reviews!

Mark Phillips
Community Manager

Riak Fast Track Revisited

May 27, 2010

You may remember a few weeks back we posted a blog about a new feature on the Riak Wiki called The Riak Fast Track. To refresh your memory, “The Fast Track is a 30-45 minute interactive tutorial that introduces the basics of using and understanding Riak, from building a three node cluster up through MapReduce.”

This post is intended to offer some insight into what we learned from the launch and what we are aiming to do moving forward to build out the Fast Track and other similar resources.

The Numbers

The Fast Track and accompanying blog post were published on Tuesday, May 5th. After that there was a full week to send in thoughts, comments, and reviews. In that time period:

  • I received 24 responses (my hope was for >15)
  • Of those 24, 10 had never touched Riak before
  • Of those 24, 13 said they were already planning on using Riak in production or after going through the
    Fast Track were now intending to use Riak in production in some capacity

The Reviews

Most of the reviews seemed to follow a loose template: “Hey. Thanks for this! It’s a great tool and I learned a lot. That said, here is where I think you can improve…”

Putting aside the small flaws (grammar, spelling, content flow, etc.), there emerged numerous recurring topics:

  • Siblings, Vector Clocks, Conflict Resolution, Concurrent Updates…More details please. How do they work in Riak and what implications do they have?
  • Source can be a pain. Can we get a tutorial that uses the binaries?
  • Curl is great, but can we get an Erlang/Protocol Buffers/language specific tutorial?
  • I’ve heard about Links in Riak but there is nothing in the Fast Track about it. What gives!?
  • Pictures, Graphics and Diagrams would be awesome. There is all this talk of Rings, Clusters, Nodes, Vnodes, Partitions, Vector Clocks, Consistent Hashing, etc. Some basic diagrams would go along way in helping me grasp the Riak architecture.
  • Short, concise screencasts are awesome. More, please!
  • The Basic API page is great but it seems a bit…crowded. I know they are all necessary but do we really need all this info about query parameters, headers and the like in the tutorial?

Another observation about the nature of the reviews: they were very long and very detailed. It would appear that a lot of you spent considerable time crafting thoughtful responses and, while I was expecting this to some extent, I was still impressed and pleasantly surprised.

This led me to draw two conclusions:

  1. People were excited by the idea of bettering the Fast Track for future Riak users to come
  2. Swag is a powerful motivator

Now, I’m going to be a naïve Community Manager and let myself believe that the Riak Developer Community maintains a high level of programmer altruism. The swag was just an afterthought, right?

So What Did We Change?

We have been doing the majority of the editing and enhancing on the fly. This process is still ongoing and I don’t doubt that some of you will notice elements still present that you thought needed changing. We’ll get there. I promise.

Here is a partial list of what was revised:

  • The majority of changes were small and incremental, fixing a phrase here, tweaking a sentence there. Many small fixes and tweaks go a long way!
  • The most-noticeable alterations are on the MapReduce page, where we worked a lot to make it flow better and more interactive. This continues to be improved.
  • The Basic API Operations page got some love in the form of simplification. After reading your comments, we went back and realized that we were probably throwing too much information at you too fast.
  • There are now several graphics relating to the Riak Ring and Consistent Hashing. There will be more.

And, as I said, this is still ongoing.

Thank You!

I’ve added a Thank You page to the end of the Fast Track to serve as a permanent shout-out to those who help revise and refine the Fast Track. (I hope to see this list grow, too.) Future newcomers to Riak will surely benefit from your time, effort, and input.

What is Next?

Since its release, the Fast Track tutorial has become the second most-visited page on the Riak Wiki, second only to the wiki.basho.com itself. This tells us here at Basho that there is a need for more tools and tutorials like this. So our intention is to expand this as far as time permits.

In the short term, we plan to add a link-walking page. This was scheduled for the original iteration of the Fast Track but was scrapped because we didn’t have time to assemble all the components. The MapReduce section is going to get more interactive, too.

Another addition will be content and graphics that demonstrate Riak’s fault-tolerance and ability to withstand node outages.

We also want to get more specific with languages. Right now, it uses curl over HTTP. This is great but language-specific makes tremendous sense, and the only preventing us from doing this is time. The ultimate vision is to expand transform the Fast Track into a sort of “choose your own adventure” module, such that if a Ruby dev who prefers Debian shows up at wiki.basho.com without having ever heard of Riak, they can click a few links and arrive at a tutorial that shows them how to spin up three nodes of Riak on Debian and query it through Ripple. Erlang, Ruby, Javascript and Java are at the top of the list.

But, we have a long way to go before we get there, so stay tuned for continuous enhancements and improvements. And if you’re at all at interested in helping develop and expand the Fast Track (say, perhaps, outlining an up-and-running tutorial for for Riak+JavaScript) don’t hesitate to shoot an email to mark@basho.com.

Mark

Community Manager

Free Webinar – Load-Testing Riak – June 3rd @ 2PM Eastern

May 26, 2010

We frequently receive questions about how Riak will behave in specific production scenarios, and how to tune it for those workloads. The answer is, generally, that it depends — the only way to know for sure is to measure!

We invite you to join us for a free webinar on Thursday, June 3 at 2:00PM Eastern Time to discuss how to measure Riak performance and develop your own action testing plan. We’ll discuss:

  • Understanding what to test, and how to test it
  • Classifying your application and its load profile
  • Configuring and using our load generation and measurement tool
  • Interpreting the results of your test and taking next steps

As part of the webinar, you will get access to and a demonstration of basho_bench, our benchmarking and load-testing tool. The presentation will last 30 to 45 minutes.

Registration will be limited to 20 parties, on a first-come, first-serve basis. Fill in the form below if you’re serious about getting a scalable, predictable, cost-effective storage solution in Riak!

UPDATE: Due to overwhelming demand for this Webinar, we are expanding the number of seats available. So don’t shy away from registering if you think it’s full. We will have plenty of room for you!

The Basho Team

Webinar Registration Form

Registration has closed. We’ll hold another one soon!

Riak Search

May 21, 2010

This post is going to start by explaining how in-the-trenches experience with key/value stores, like Riak, led to the creation of Riak Search. Then it will tell you why you care, what you’ll get out of Riak Search, and why it’s worth waiting for.

A bit of history

Few people know that Basho used to develop applications for deployment on Salesforce.com. We had big goals, and were thinking big to fill them, and part of that was choosing a data storage system that would give us what we needed not only to succeed and grow, but to survive – a confluence of pragmatism and ideal that embodied a bulletproof operations story, a path upward — resilience, reliability, and scalability, through the use of proven science.

So, that’s what we did: we developed and used what has grown to be, and what you know today, as Riak.

Idealism can’t get you everywhere, though. While we answered hard questions with link-walking and map/reduce, there was still the desire in the back of all of our heads: sometimes you just want to ask, “What emails were sent on May 21 that included the word ‘strategy’?” without having to figure out how to walk links from an organizational chart to mailboxes to mails, and then filter over the data there. It was a pragmatic desire: we just wanted a quick answer in order to decide whether or not to spend more time chasing a path. “Less yak-shaving, please.”

The Operations Story

Then we stopped making Salesforce.com apps, and started selling Riak. We quickly found the same set of desires. Operationally, Riak is a huge win. Pragmatically, something that does indexing and search in a similar operational manner is even bigger. Thus, Riak Search was born.

The operational story is, in a nutshell, this: when you add another node to your cluster, you add capacity and compute power. That’s it, you just add another box and “it just works.” Purposefully or not, eventually a node leaves the cluster, hardware fails, whatever: Riak deals with it. If the node comes back, it’s absorbed like it never left.

We insisted on these qualities for Riak, and have continued that insistence in Riak Search. We did it with all the familiar bits: consistent hashing, hinted handoff, replication, etc.

Why Riak Search?

Now, we’ll be the first to tell you that with Riak you can get pretty far using link-walking and map/reduce, with the understanding that you know what you are going to want ahead of time, and/or are willing to wait for it.

Riak Search answers questions that pop into your head; “find me all the blue dresses that are between $20 and $30 dollars,” “find me the document Bob referred to last week at the TPS procedures meeting,” “how can I delete all these emails from my aunt that have those stupid attachments?” “find me that comic strip with Bob,” etc.

It’s about making members of the sea of data in your key-value store findable. At a higher level, it’s about agility. The ability to answer questions you have about your business and your customers without having to consult a developer or dig through reference manuals and without your application developers having to reinvent the wheel with a very real possibility of doing it just right enough to assure you nothing will go wrong. It’s about a common indexing language.

Okay, now you know — thanks for bearing with us — let’s get to the technical bits.

Riak Search …

The system we have built …

  1. is an easily-scalable, fault-tolerant search and indexing system, adhering to the operational story you just read
  2. supports full-text indexing and search
  3. allows querying via the Lucene query syntax
  4. has Solr-compatible /select and /update web-services
  5. supports date and numeric indexing
  6. supports faceting
  7. automatically distributes indexes
  8. has an intermediate query language and integrated query planner
  9. supports scoring
  10. has integrated tokenizing, filtering and analysis (yes, you can
    use StandardAnalyzer!)

… and much more. Sounds pretty great, right?

If you want to know more about the internals and technical nitty gritty, check out the Riak Search presentation one of our own, Riak Search engineer John Muellerleile, gave at the San Francisco Erlang Factory this year.

So, why don’t you have it yet? The easy part.

There are still some stubs and hard-coded things in what we have. For instance, the full-text analyzer in use is just whitespace, case-normalization, and stop-word filtering. We intend to fully support the ability to specify other Lucene analyzers, including custom modules, but the code isn’t there yet.

There is also very little documentation. Without a little bit of handholding, even the brightest and most ambitious user could be forgiven for staring blankly, lost for even the first question to ask. We’re spreading the knowledge among our own team right now; that process will generate the artifacts needed for the next round of users to step in.

There are also many fiddly, finicky bits. These are largely relics of early iterations. Rather than having the interwebs be flooded with, “How do you stop this thing?” (as it was with Riak), we’re going to make things friendlier.

So, why don’t you have it yet? The not-so-easy part.

You’ve probably asked yourself, “What of integration of Riak and Riak Search?” We have many notes from discussions about how it could or should be done, as well as code showing how it can be done. But, we’re not completely satisfied with any of our implementations so far.

There are certainly no shortage of designs and ideas on how this could or should work, so we’re going to make a final pass at refining all of our ideas, given our current Riak Search system to play with, so that we can provide a solid, extensible system, instead of one that with many rough edges that would almost certainly be replaced immediately.

Furthering this sentiment is that we think that our existing map/reduce framework and the functionality and features provided by Riak Search are a true power combo when used together intelligently, than simply as alternatives, or at worse, at odds. As a result, we’re defining exactly how Riak Search indexing and querying should be threaded into Riak map/reduce processing to bring you a combination that is undoubtedly more than the sum of its parts.

We could tease you with specifics, like generating the set of bucket/key inputs to a map phase by performing a Riak Search query, or parameterizing Search phases with map results; though, for now, amidst protest both internally — we’re chomping at the bit to get this out into the world and into your hands  - and externally, as our favorite people continually request this exact set of technology and features, we’re going to implement the few extra details from our refined notes before forcing it on you all.

Hold on just a little longer. :)

-the Riak Search Team

Webmachine in the Data Center

May 19, 2010

While Riak is Basho’s most-heavily developed and widely distributed piece of open source software, we also hack on a whole host of other projects that are components of Riak but also have myriad standalone uses. Webmachine is one of those projects.

We recently heard from Caleb Tennis, a member of The Data Cave team who was tasked with building out and simplifying operations in their 80,000 square foot data center. Caleb filled us in on how he and his team are using Webmachine in their day to day operations to iron out the complexities that come with running such a massive facility. After some urging on our part, he was gracious enough to put together this illustrative blog post.

Enjoy,

Mark

Community Manager

A Data Center from the Ground Up

Building a new data center from the ground up is a daunting task. While most of us are familiar with the intricacies of the end product (servers, networking gear, and cabling), there’s a whole backside supporting infrastructure that also must be carefully thought out, planned, and maintained. Needless to say, the facilities side of a data center can be extremely complex.

Having built and maintained complex facilities before, I already had both experience and a library of software in my tool belt that I had written to help manage the infrastructure of these facilities. However, I recognized that if I was to use the legacy software, some of which was over 10 years old, it would require considerable work to fit it to my current needs. And, during that period, many other software projects and methodologies had matured to a state that it made sense to at least consider a completely different approach.

The main crux of such a project is that it involves communications with many different pieces of equipment throughout the facility, each of which has its own protocols and specifications for communication. Thus, the overall goal of this project is to abstract the communications behind the scenes and present a consistent and clear interface to the user so that the entire process is easier.

Take for example the act of turning on a pump. There are a number of pumps located throughout the facility that need to be turned on and off dynamically. To the end user, a simple “on/off” style control is what they are looking for. However, the actual process of turning that pump on is more complicated. The manufacturer for the pump controller has a specific way of receiving commands. Sometimes this is a proprietary serial protocol, but other times this uses open standard protocols like Modbus, Fieldnet, or Devicenet.

In the past, we had achieved this goal using a combination of open source libraries, commercial software, and in-house software. (Think along the lines of something like Facebook’s Thrift, where you define the interface and let the backend implementation be handled behind the scenes;in our case, the majority of the backend was written in C++.)

“This is what led us to examine Erlang…”

But as we were looking at the re-implementation of these ideas for our data center, we took a moment to re-examine them. The implementation we had, for the most part, was stateless, meaning that as systems would GET and SET information throughout the facility, they did so without prior knowledge and without attempting to cache the state of any of the infrastructure. This is a good thing, conceptually, but is also difficult in that congestion on the communication networks can occur if too many things need access to the same data frequently. It also suffered from the same flaws as many other projects: it was big and monolithic; changes to the API were not always easy; and, most of all, upgrading the code meant stops and restarts, so upgrading was done infrequently. As we thought about the same type of implementation in our data center, it became clear that stops and restarts in general were not acceptable at all.

This is what led us to examine Erlang, with its promise of hot code upgrades, distributed goodness, and fault -tolerance. In addition, I had been wanting to learn Erlang for a while now but never really had an excuse to sit down and focus on it. This was my excuse.

While thinking about how this type of system would be implemented in Erlang, I began by writing a Modbus driver for Erlang, as a large portion of the equipment we interact with uses Modbus as part of its communications protocol. I published the fruits of these labors to GitHub (http://github.com/ctennis/erlang-modbus), in the hopes that it might inspire others to follow the same path. The library itself is a little rough (it was my first Erlang project) but it served as the catalyst for thinking about not only how to design this system in Erlang, but also how to write Erlang code in general.

Finding Webmachine

While working on this library, I kept thinking about the overall stateless design, and thought that perhaps a RESTful interface may be appropriate. Using REST (and HTTP) as the way to interface with the backend would simplify the frontend design greatly, as there are myriad tools already available for client side REST handling. This would eliminate the need to write a complicated API and have a complicated client interface for it. This is also when I found Webmachine.

Of course there are a number of different ways this implementation could have been achieved, Erlang or not. But the initial appeal of Webmachine was that it used all of the baked in aspects of HTTP, like the error and status codes, and made it easy to use URLs to disseminate data in an application. It is also very lightweight, and the dispatching is easy to configure.

Like all code, the end result was the product of several iterations and design changes, and may still be refactored or rewritten as we learn more about how we use the code and how it fits into the overall infrastructure picture.

Webmachine in Action

Let’s look at how we ultimately ended up using Webmachine to communicate with devices in our data center…

For the devices in the facility that communicate via modbus, we created a modbus_register_resource in Webmachine that handles that interfacing. For our chilled water pumps (First Floor, West, or “1w”), the URL dispatching looks like this:

{["cw_pump","1w",parameter],modbus_register_resource,[cw_pump, {cw_pump_1w, tcp, "10.1.31.202", 502, 1}]}.

This correlates to the url: http://address:8000/cw_pump/1w/PARAMETE

So we can formulate URIs something like this:

http://address:8000/cw_pump/1w/motor_speed or http://address:8000/cw_pump/1w/is_active

And by virtue of the fact that our content type is text:

content_types_provided(RD,Ctx) ->
{[{"text/plain", to_text}], RD, Ctx}.

We use HTTP GETs to retrieve the desired result, as text. The process is diagrammed below:

This is what is looks like from the command line:

user@host:~# curl http://localhost:8000/cw_pump/1w/motor_speed

47.5

It even goes further. If the underlying piece of equipment is not available (maybe it’s powered off), we use Webmachine to send back HTTP error codes to the requesting client. This whole process is much easier than writing it in C++, compiling, distributing, and doing all of the requisite exception handling, especially across the network. Essentially, what had been developed and refined over the past 10 years as our in-house communications system was basically redone from scratch with Erlang and Webmachine in a matter of weeks.

For updating or changing values, we use HTTP POSTs to change values that are writable. For example, we can change the motor speed like this:

user@host:~# curl -X POST http://localhost:8000/cw_pump/1w/motor_speed?value=50

user@host:~# curl http://localhost:8000/cw_pump/1w/motor_speed

50.0

But we don’t stop here. While using Webmachine to communicate directly with devices is nice, there also needs to exist an infrastructure that is more user friendly and accessible for viewing and changing these values. In the past, we did this with client software, written in Windows, that communicated with the backend processes and presented a pretty overview of what was happening. Aside from the issue of having to write the software in Windows, we also had to maintain multiple copies of it actively running as a failover in case one of them went down. Additionally, we had to support people running the software remotely, from home for example, via a VPN, but still had to be able to communicate with all of the backend systems. We felt a different approach was needed.

What was this new approach? Webmachine is its own integrated webserver, so this gave us just about everything we needed to host the client side of the software within the same infrastructure. By integrating jQuery and jQTouch into some static webpages, we built an entire web based control system directly within Webmachine, making it completely controllable via mobile phone. Here is a screenshot:

What’s Next?

Our program is still a work in progress, as we are still learning all about the various ways the infrastructure works as well as how we can best interact with it from both Webmachine and the user perspective. We are very happy with the progress made thus far, and feel quite confident about the capabilities that we will be able to achieve with Webmachine as the front end to our very complex infrastructure.

Caleb

Caleb Tennis is President of The Data Cave, a full service, Tier IV compliant data center located in Columbus, Indiana.

Planning for Eventual Consistency

May 14, 2010

You may remember that last week, we recorded a podcast with Benjamin Black all about the immense variety of databases in the NoSQL space and what your criteria should be when choosing one.

If you listened carefully, you may also remember that Benjamin and Justin Sheehy started to discuss eventual consistency. We decided to roll that into its own podcast as we thought it was a topic worthy of its own episode.

Think there are only a certain subset of databases that are “eventually consistent”? Think again. Regardless of the database you choose, eventual consistency is something you should embrace and plan for, not fear.

Listen, Learn, and Enjoy – -

Mark


If you are having problems getting the podcast to play, click here to play in new window or right click to download the podcast.

NoSQL and Endless Variety

A Conversation about Variety and Making Choices

Benjamin Black stopped by the Basho offices recently and we had the chance to sit him down and discuss the collection of things that is “NoSQL.”

In this, the fifth installment of the Basho Riak Podcast, Benjamin and Basho’s CTO Justin Sheehy discuss the factors that they think should play the largest part in your evaluation of any database, NoSQL or otherwise. (Hint: it’s not name recognition.)

Highlights include why and when you might be best served improving your relational database architecture and when it might be better to use a NoSQL system like Cassandra or Riak to solve part of your problem, as well as why you probably don’t want to figure out which one of the NoSQL systems solves all of your problems.

Enjoy,

Mark



If you are having problems getting the podcast to play, click here to play in new window or right click to download the podcast.