July 3, 2013
Basho CTO, Justin Sheehy, recently participated in a “Not Only SQL Summit,” alongside executives from some of the top NoSQL vendors. This summit was moderated by Ted Neward of Neward & Associates LLC and discussed the evolution of NoSQL systems as well as some associated best practices. It also included insights from customers currently using these NoSQL solutions.
In addition to Justin Sheehy, panelists included:
- Anthony Molinaro, Infrastructure Architect at OpenX, discussing how they use Riak
- Patrick McFadin, Principal Solution Architect at DataStax
- Michael Kjellman, Software Engineer at Barracuda Networks, discussing how they use Cassandra
- Justin Weiler, CTO at FatCloud
- Attinder Khalsa, Executive Software Architect at Wilshire Axon, discussing how they use FatDB
Throughout this summit, OpenX, Barracuda Networks, and Wilshire Axon discussed not only why they chose to move away from relational systems but also why they chose the NoSQL vendor that they did. They also talk about their experiences dealing with eventual consistency and schemaless data. You can view the full summit below:
April 17, 2013
This post looks at five commonly asked questions about Riak. For more questions and answers, check out our Riak FAQ.
What hardware should I use with Riak?
Riak is designed to be run on commodity hardware and is run in production on a variety of different server types on both private and public infrastructure. However, there are several key considerations when choosing the right infrastructure for your Riak deployment.
RAM is one of the most important factors – RAM availability directly affects what Riak backend you should use (see question below), and is also required for complex MapReduce queries. In terms of disk space, Riak automatically replicates data according to a configurable n_val. A bucket-level property that defaults to 3, n_val determines how many copies of each object will be stored, and provides the inherent redundancy underlying Riak’s fault-tolerance and high availability. Your hardware choice should take into consideration how many objects you plan to store and the replication factor, however, Riak is designed for horizontal scale and lets you easily add capacity by joining additional nodes to your cluster. Additional factors that might affect choice of hardware include IO capacity, especially for heavy write loads, and intra-cluster bandwidth. For additional factors in capacity planning, check out our documentation on cluster capacity planning.
Riak is explicitly supported on several cloud infrastructure providers. Basho provides free Riak AMIs for use on AWS. We recommend using large, extra large, and cluster compute instance types on Amazon EC2 for optimal performance. Learn more in our documentation on performance tuning for AWS. Engine Yard provides hosted Riak solutions, and we also offer virtual machine images for the Microsoft VM Depot.
What backend is best for my application?
Riak offers several different storage backends to support use cases with different operational profiles. Bitcask and LevelDB are the most commonly used backends.
Bitcask was developed in-house at Basho to offer extremely fast read/write performance and high throughput. Bitcask is the default storage engine for Riak and ships with it. Bitcask uses an in-memory hash-table of all keys you write to Riak, which points directly to the on-disk location of the value. The direct lookup from memory means Bitcask never uses more than one disk seek to read data. Writes are also very fast with Bitcask’s write-once, append-only design. Bitcask also offers benefits like easier backups and fast crash recovery. The inherent limitation is that your system must have enough memory to contain your entire keyspace, with room for a few other operational components. However, unless you have an extremely large number of keys, Bitcask fits many datasets. Visit our documentation for more details on Bitcask, and use the Bitcask Capacity Calculator to assist you with sizing your cluster.
LevelDB is an open-source, on-disk key-value store from Google. Basho maintains a version of LevelDB tuned specifically for Riak. LevelDB doesn’t have Bitcask’s memory constraints around keyspace size, and thus is ideal for deployments with a very large number of keys. In addition to this advantage, LevelDB uses Google Snappy data compression, which provides particular efficiency for text data like raw text, Base64, JSON, HTML, etc. To use LevelDB with Riak, you must the change the storage backend variable in the app.config file. You can find more details on LevelDB here.
Riak also offers a Memory storage backend that does not persist data and is used simply for testing or small amounts of transient state. You can also run multiple backends within a single Riak instance, which is useful if you want to use different backends for different Riak buckets or use a different storage configuration for some buckets. For in-depth information on Riak’s storage backends, see our documentation on choosing a backend.
How do I model data using Riak’s key/value design?
Riak uses a key/value design to store data. Key/value pairs comprise objects, which are stored in buckets. Buckets are flat namespaces with some configurable properties, such as the replication factor. One frequent question we get is how to build applications using the key/value scheme. The unique needs of your application should be taken into account when structuring it, but here are some common approaches to typical use cases. Note that Riak is content-agnostic, so values can be any content type.
|Session||User/Session ID||Session Data|
|Content||Title, Integer||Document, Image, Post, Video, Text, JSON/HTML, etc.|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, Email, UUID||User Attributes|
For more comprehensive information on building applications with Riak’s key/value design, view the use cases section of our documentation.
What other options, besides strict key/value access, are there for querying Riak?
Most operations done with Riak will be reading and writing key/value pairs to Riak. However, Riak exposes several other features for searching and accessing data: MapReduce, full-text search, and secondary indexing.
Riak also provides Riak Search, a full-text search engine that indexes documents on write and provides an easy, robust query language and SOLR-like API. Riak Search is ideal for indexing content like posts, user bios, articles, and other documents, as well as indexing JSON data. For more information, see the documentation on Riak Search.
Secondary indexing allows you to tag objects in Riak with one or more queryable values. These “tags” can then be queried by exact or range value for integers and strings. Secondary indexing is great for simple tagging and searching Riak objects for additional attributes. Check out more details here.
How does Riak differ from other databases?
We often get asked how Riak is different from other databases and other technologies. While an in-depth analysis is outside the scope of this post, the below should point you in the right direction.
Riak is often used by applications and companies with a primary background in relational databases, such as MySQL. Most people who move from a relational database to Riak cite a few reasons. For one, Riak’s masterless, fault-tolerant, read/write available design make it a better fit for data that must be highly available and resilient to failure scenarios. Second, Riak’s operational profile and use of consistent hashing means data is automatically redistributed as you add machines, avoiding hot spots in the database and manual resharding efforts. Riak is also chosen over relational databases for the multi-datacenter capabilities provided in Riak Enterprise. A more detailed look at the difference between Riak and traditional databases and how to make the switch can be found in this whitepaper, From Relational to Riak.
A more detailed look at the technical differences between Riak and other NoSQL databases can be found in the comparisons section of our documentation, which covers databases such as MongoDB, Couchbase, Neo4j, Cassandra, and others.
February 24, 2013
Recently, Basho engineer, Eric Redmond, published “A Little Riak Book.” This book is available free for download at littleriakbook.com and provides a great overview of Riak, including how to think about a distributed system compared to more traditional databases.
The book starts with a discussion on concepts. Since Riak is a distributed NoSQL database, it requires developers to approach problems differently than they would with a relational database. The concepts section describes the differences between various NoSQL systems, takes an in-depth look at Riak’s key/value data model, and describes how Riak is designed for high availability (as well as how it handles eventual consistency constraints). After laying the theoretical groundwork, the book walks developers through how to use Riak by explaining the different querying options and showing them how to tinker with settings to meet different use case needs. Finally, it covers the basic details that operators should know, such as how to set up a Riak cluster, configure values, use optional tools, and more.
After finishing the book, start playing around with Riak to see if it’s the right fit for your needs. You can download Riak on our Docs Page.
August 10, 2012
We have a poorly defined term in our industry: “NoSQL.” [Does your toaster run SQL? No? Then you own a NoSQL toaster.] Be that as it may, Riak falls under the umbrella of software that carries this label. In our attempt to own the label, we reinterpret it to mean that we now have more choices as developers. For too long, our only meaningful options for data storage were SQL relational databases and the file system.
In the past few years, that has changed. We now have many production-ready tools available for storing and retrieving data, and many of those fall within the sphere of NoSQL. With all of these new options, how do we as developers choose which database to use?
On the Professional Services Team, this is the first question we ask ourselves: What is the best storage option for this application? At Basho, Professional Services goes on-site to assist clients with training, application development, operational planning – anything to help get the most out of Riak. In order to know how to do that, we also have to know quite a bit about other NoSQL databases and storage options, and when it might be a better option to go with something other than Riak. Below we outline some of our reasoning when we evaluate Riak for our clients and their applications.
A Simple Key-Value Store
When our clients simply need a key-value store, our job as consultants couldn’t get any easier. Riak is a great key-value database with an excellent performance profile, fantastic high availability and scaling properties, and the best deployment/operations story that we know. We are very proud of our place in the industry when it comes to these features.
But when the business logic for the application requires an access pattern more sophisticated than a simple key lookup, we have to dig deeper to figure out whether Riak is the right tool for the job. We have evolved the following distinguishing criteria:
If there is a usage scenario requiring ad-hoc, dynamic querying, then we might consider alternative solutions.
- Ad-hoc: by this we mean that queries run at unpredictable times, possibly triggered by end-users of the application.
- Dynamic: by this we mean that queries are constructed at the time they are being run.
If the usage scenario requires neither ad-hoc nor dynamic queries, then we can usually construct the application in such a way that even complex analysis works well with Riak’s key-value nature. If the scenario requires ad-hoc but not dynamic queries, then we have look at options to tune performance of the known access patterns. If the scenario requires dynamic queries run on a regular basis, then we might investigate running the dynamic queries on an ‘offline’ cluster replica so that we don’t interfere with the availability of the ‘online’ production clusters.
These criteria began to take form in our evaluations of Riak for data analytics. We often see Riak deployed as a Big Data solution because of its exceptional fault-tolerance and scaling properties, and running analytics on Big Data is a common use case. MapReduce gives us the ability to run sophisticated analytics on Riak, but other solutions exist that are optimized for analytics in ways that Riak is not. It is generally not a good idea to run MapReduce on a production Riak cluster for data analysis purposes. MapReduce exists in Riak primarily for data maintenance, data migrations, or offline analysis of a cluster replicate. All three of these are good use cases for Riak’s MapReduce implementation.
Key-Value State of Mind
Does that mean that data analysis applications are off the table? Absolutely not! In our training sessions and workshops, we emphasize that key-value databases requite a different mindset than relational databases when you are planning your application.
In traditional SQL applications, we as engineers start defining the data model, normalizing the data, and structuring models in such a way that relations can be fetched efficiently with appropriate indexing. If we do a good job modeling the data, then we can proceed with reasonable certainty that the application built on top if it will unfold naturally. The developers of the application layer will take advantage of well-known patterns and practices to construct their queries and get what they want out of the data model. It’s no surprise that SQL is pretty good for this kind of thing.
In a key-value store, we approach the software architecture from the opposite side and proceed in the other direction. Instead of asking what the data model should look like and working up to the application view, we begin by asking what the resulting view will look like and then work ‘backwards’ to define the data model. We start with the question: What do you want the data to look like when you fetch it from the database?
If we can answer the above question, and if we can define the structure of the result that we want in advance, then we probably have a good case for pre-processing the results. We pre-process the data in the application layer before it enters Riak, and then we just save the answer that we want as the value of a new key-value pair. In these cases, we can often get better performance when fetching the result than a relational approach because we don’t have to perform
the computation of compiling and executing the SQL query.
A rolling average is a simple example: Imagine that we want to have the average of some value within data objects that get added to the system throughout the day. In a SQL database, we can just call
average() on that column, and it will compute the answer at query time. In a key-value store, we can add logic in the application layer to catch the object before it enters Riak, fetch the average value and number of included elements from Riak, compute the new rolling average, and save that answer back in Riak. The logic in the application layer is now slightly more complicated, but we weigh this trade-off against the simplicity of administering the key-value database instead of a relational one. Now, when you go to fetch the average, it doesn’t have to compute it for you. It just returns the answer.
With the right approach, we can build applications in such a way that they work well with a key-value database and preserve the highly available, horizontally scaling, fault-tolerant, easy-as-pie administration that we have worked so hard to provide in Riak. We look forward to continuing to help you get the most out of Riak, and choosing the best tool for the job.
See Sean’s excellent post on Schema Design in Riak
: In some situations, using MapReduce to facilitate a bulk fetch provides better performance than requesting each object individually because of the connection overhead. If you go that route, be sure to use the native Erlang MapReduce functions like ‘reduce_identity’ already available in Riak. As always, test your solution before putting it into production.
May 11, 2011
I’m on a plane to Goto Copenhagen from our electric post-Kill-Dash-Nine Board meeting in Washington, DC and, afterwards, an intense client meeting. I went to watch Pete Sheldon, our new Director of Sales, and Justin Sheehy at work. I finally had a chance to sit and study a proposal for the Basho product roadmap for the next year. This roadmap is both breathtakingly ambitious and oddly incremental, quotidian even.
In the next year we will solve problems endemic to distributed systems – groundbreaking work of the sort careers are surely made — and yet at the same time, these problems seem incremental and iterative; part of an ongoing process of small improvements. They seem both astounding and inevitable.
This led me to an interesting insight — doing this is not easy.
What we are doing is like digging a canal through bedrock. We are hacking away at hard problems — problems others encountered and, either died trying or, mopping their brows with their handkerchiefs, threw down their shovels and went shopping. A lot of cool companies are hacking away, too, so it is not like we are alone, but the honorable and the diligent are not what this post is about.
This post is about the ugly truth I have to call out.
To put it bluntly, if you are claiming the architectural challenges presented by applications with high write loads spread across multiple data centers are easy, you are lying. You do not, as Theo Schlossnagle remarked recently to us, “respect the problem.” You must respect the problem or you disrespect the necessary tradeoffs. And if you disrespect the tradeoffs, you disrespect your user. And if you disrespect your user, you are, inevitably, a liar. You say _this_ is easy. You promise free lunches. You guarantee things that turn out to be impossible. You lie.
What our technology generation is attempting is really hard. There is no easy button. You can’t play fast and loose with the laws of physics or hand-wave around critical durability issues. You can sell this stuff to your venture capitalist, but we’re not buying it.
Immutable laws are not marketing. And therefore, marketing can’t release you from the bonds of immutable laws. You can’t solve the intractable problems of distributed systems so eloquently summarized with three letters – C-A-P – by Google’s cloud architect (and Basho Board member) Dr. Eric Brewer (a man both lauded and whose full impact on our world has not yet been reckoned), with specious claims about downloads and full consistency.
- Memory is not storage.
- Trading the RDBMS world for uptime is hard. There are no half-steps. No transitional phases.
- The geometry of a spinning disk matters for your app. You can’t escape this.
- Your hardware RAID controller is not perfect. It screws things up and needs to be debugged.
- Replication between two data centers is hard, let alone replication between three or 15 data centers.
- Easily adding nodes to a cluster under load impacts performance for a period determined by the amount of data stored on the existing nodes and the load on the system…and the kind of servers you are using…and a dozen other things. It looks easy in the beginning.
These are all sensible limitations. Like the speed of light or the poor quality of network television, these are universal constants. The point is: tradeoffs can’t be solved by marketing.
To be sure, there are faster databases than Riak. But do they ship with default settings optimized for speed or optimized for safety? We *ache* to be faster. We push ourselves to be faster. We tune and optimize and push. But we will never cross the line to lose data. While it is always tempting to set our defaults to *fast* instead of *safe*, we won’t do it. We will sacrifice speed to protect your data. In fact, if you prefer speed to preserving data, *don’t use Riak*. We tell the truth even if it means losing users. We will not lie.
Which is why others who do it make me ball my fists, score my palms, and look for a heavy bag to punch. Lying about what you can do – and spreading lies about other approaches – is a blatant attempt to replace the sacrifice of hard-core engineering and ops with fear, uncertainty, and doubt – FUD.
People who claim they are “winning NoSQL” with FUD are damaging our collective chance to effect a long-overdue change to the way data is stored and distributed. This opportunity is nothing short of a quantum shift in the the quality of your life if you are in development, operations, or are a founder who lives and dies by the minute-to-minute performance of your brainchild/application.
The FUD-spreaders are destroying this opportunity with their lies. They are polluting the well by focusing on false marketing – on being the loud idiot drunk – instead of solving the problem. They can screw this up with their failure. It is time for us to demand they drop the FUD – drop the “F” bomb – and stop lying about what they can do. Just tell the truth, like Basho does — admit this is a hard problem and that hardcore engineering is the answer. In fact, they should do the honorable thing and quit the field if they are not ready to invest in the work needed to solve this problem.
If we, collectively, the developer and sysadmin community running the infrastructure of the world economy, allow people to replace engineering with marketing lies, to trade coffee mugs for countless hours of debugging, and in doing so, to destroy the reputation of a new class of data storage systems before they have had a chance to gain a foothold in the technology world, we all lose.
There are many reasons why the FUD spreaders persist.
There are the smart folks who throw our hands up and cynically say that liars are by their nature better marketers. But marketing need not be lies, cynically accepted.
Then there are some of us who are too busy keeping projects or businesses afloat to really dig into the facts. But we sense that we are being lied to, and so we detach, saying this is all FUD. This can’t help us. Tragically, we miss the opportunity to make a big change
Most of us simply want to trust other developers and will believe claims that seem too good to be true. If we do this, we are in a small but serious way saying that our hard-won operational wisdom is meaningless, that anyone who has deployed a production application or contributed to an open-source project has no standing to challenge the loud-mouth making claims that up-time is easy.
Up-time is not easy. Sleeping through the night without something failing is a blessing. Do not – *do not* – let VCs and marketers mess up our opportunity to take weekends off and sleep through the night when we are on call. The database technologies of 1980 (and their modern apologists in NoSQL) should not shape the lives of technologists in 2011.
In the briefest terms, Basho won’t betray this revolution because we keep learning big lessons from our small mistakes. We are our harshest critics.
We will deliver a series of releases that allow you to tune for the entire spectrum of CAP tradeoffs – strong consistency to strong partition tolerance – while making clear the tradeoffs and costs. At the same time Riak will provide plugins for Memcache, secondary indices, and also a significant departure from existing concepts of MapReduce that allows for flexible, simple, yet massively distributed computation, and much more user-friendly error reporting and logging. (Everyone reading this understands why that last item merits inclusion on any list of great aspirations – usability can damn or drive a project.)
We will deliver these major innovations, improvements, and enhancements, and they will be hard for us and our community to build. And it will take time for us to explain it to people. And you will find bugs. And it will work better a year after it ships than it does on day one.
But we will never lie to you.
We call on others to please drop the FUD, to acknowledge the truth about the challenges we all face running today’s infrastructure, and to join us in changing the world for the better.
December 2, 2010
In the last two weeks, Basho has been fortunate to sign up some pretty cool clients. Considering we are a young company, that a database is among the stickiest pieces of software and therefore decisions to deploy something new are undertaken with caution, and that we have spent approximately $7,000 on marketing (mostly on sponsorship of a single event), the fact we are getting ten leads a week and converting leads to customers seems pretty amazing.
While this obviously puts the lie to the idea that the market for NoSQL is too early to build a business on, one thing is certain: what people want from NoSQL varies from significantly from client to client.
Some want high availability (especially write-availability) and scalability. Some want distributed analytical capabilities and low latency on queries of big data sets. Some want both. All of the people we are talking to have specific applications in mind and all of them are interested in using NoSQL to do something they really could not do before.
This is the proverbial “greenfield” for NoSQL. Not verticals (and especially not social networking, which is over-represented in examples because two of the great early NoSQL data stores were developed by Facebook and LinkedIn), but pent up demand is where we see growth and opportunity.
Some investors and product types worry this means there is no specific niche NoSQL fills, meaning the market is small and making it hard for small companies to thrive. While I happen to agree with the premise (there is no specific niche), I view that as an indicator of the potentially massive size of the opportunity. We are seeing pent up demand from companies that want to build web applications that are more reliable, scale better, use distributed map/reduce and indexing features, and run in data centers across continents.
No niche there.
November 11, 2010
Things are moving incredibly fast in the NoSQL space. I am used to internet-fast — helping bring on 300 customers in a year at Akamai; going from adult bulletin boards and leased lines to hosting sites for twenty percent of the Fortune 500 at Digex (Verizon Business) in eighteen months. I have never seen a space explode like the NoSQL space.
Two weeks ago, Justin Sheehy stood on stage delivering a rousing and thoughtful presentation to the NoSQL East Conference that was less about Riak and more about a definition of first principles that underpinned Riak: what it REALLY means when you claim such terms as scalability (it doesn’t mean buying a bigger machine for your master DB) and fault-tolerance (it has to apply to writes and reads and is binary; you either always accept writes and serve reads or you don’t). The conference was a bit of a coming out party for Basho, which co-sponsored the event with Rackspace, Georgia Tech, and a host of other companies. We had been working on Riak for 18 months or so in relative quiet and it was nice to finally see what people thought, first hand.
There were equally interesting presentations about Pig and MongoDB and a host of other NoSQL entrants, all of which will make for engrossing viewing when they finally get posted. We were told this wasn’t quite as exciting as the NoSQL conference out West but none of us seemed to mind. Home Depot, Turner Broadcasting, Weather.com, and Comcast had all sent folks down to evaluate the technology for real, live problems and the enthusiasm in the auditorium spilled out into the Atlanta bars. Business cards were exchanged, calls set up, even a little business discussed. Clearly, NoSQL databases were maturing fast.
No sooner had we returned to Cambridge than news of Flybridge’s investment in 10Gen came out. Hooray! Someone was willing to bet a $3.4 million dollars on a company in the space. Chip Hazard, ever affable, wrote a nice blog post explaining the investment. According to him, every developer they talked to had downloaded some NoSQL database to test. Brilliant news. He said Flybridge invested in 10Gen because they liked the space and knew the team from their investment in Doubleclick, from whose loins the management team at 10Gen issued. No more felicitous reason exists for a group of persons to invest $3.4 million than that previous investments in the same team were handsomely rewarded. I would wish Chip and 10Gen the best if I had time.
Because contemporaneous with the news of Flybridge’s investment, and almost as if the world had decided NoSQL’s time had come, we began to field emails and calls from interested parties. Trials, quotes, lengthy discussions about features and uses of Riak — the week was a blur. Everyone was conducting a bakeoff: “I have a 4TB database and customers in three continents. I am evaluating Riak and two other document datastores. Tell me about your OLAP features.”
Heady times and, frankly, of somewhat dubious promise, if you ask me. Potential clients that materialize so quickly always seem to disappear just as fast. Really embracing a new technology requires trials, tests, new features, and time. Time most off all. These “bluebirds” would fly away in no time, if my experience held true.
Except, this time it didn’t happen. Contracts were exchanged. Pen nibs were sharpened. It is as if the entire world decided to not wait for the everyone else to jump on the bandwagon and instead, decided to go NoSQL. Even using this last week as the sole example, I think the reason is plain — people have real pain and suddenly the word is out that they no longer have to suffer.
Devs are constrained by what they can build, rich features notwithstanding. Ask the company that had to choose between Riak and a $100K in-memory appliance to scale. And Ops is getting slaughtered — the cost of scaling poorly (and by poorly I mean pagers going off during dinner, bulk updates taking hours and failing all the time, fragmented and unmanageable indices consuming dozens of machines) is beginning to look like the cost of antiquated technology. Good Ops people are not fools. They look for ways to make life easier. Make no mistake — all the Devs and Ops folks came with a set of tough questions and a list of new features. They also came with an understanding that companies that release open source software still have a business to run. They are willing to spend on a real company. In fact, having a business behind Riak ended up mattering as much as any features.
So, I suspect, we are at the proverbial “end of the beginning.” Smart people in the NoSQL movement have succeeded in building convincingly good software and then explaining the virtues convincingly (all but one of the presentations at NoSQL East demonstrated the virtues of the respective approaches). Now these people are connecting to smart people responsible for building and running web apps, people who are decidedly unwilling to sit around hoping for Oracle or IBM to solve their problems.
In the new phase — which we will cleverly call the “beginning of the middle” — great tech will matter even more than it does now. It won’t be about selling or marketing or any of that. If our numbers are any indication of a larger trend, more people will download and install NoSQL databases in the next month than the combined total of the three months previous. More people in a buying frame of mind will evaluate NoSQL technology not in terms of its coolness but in terms of its ability to solve their real, often expensive problems. The next phase will be rigorous in a way this phase was not. People have created several entirely new ways to store and distribute data. That was the easy part.
Just as much as great tech, the people behind it will matter. That means more calls between us and Dev teams. That means more feature requests considered and, possibly, judiciously, agreed to.
That also means lots of questions answered. People care about support. They care about whether you answer their emails in a timely fashion and are polite. People want to do business with NoSQL. They want to spend money to solve problems. They need to know they are spending it with responsible, responsive, dedicated people.
Earl tweets about it all the time and I happen to agree: any NoSQL success helps all NoSQL players. I also happen to feel that any failure hurts all NoSQL players. As NoSQL rapidly ages into its adolescence, it will either be awkward and painful or exciting and characterized by incredible growth.
When I was a kid on the Navy base in Alameda, my babysitter watched soaps all afternoon, leaving me mostly to my own devices. If I stopped in, I always got roped in to hearing her explain her favorite stories. Most of all she loved how ridiculous they were, though she would never admit this exactly. Instead, adopting an attitude of gleeful incredulity, she would point out this or that attractive young actor and tell me how just a year ago, she was a little baby. “Soap people have to grow up quick, I guess,” was her single (and to her, completely satisfactory) explanation. “If they don’t, they get written out of the story.”
October 26, 2010
Basho is hosting one event this week and participating in another. Here are the details to make sure everyone is up to speed:
A NOSQL Evening in Palo Alto
Tonight there will be a special edition of the Silicon Valley NoSQL Meetup, billed as “A NOSQL Evening in Palo Alto.” Why do I say “special”? Because this month’s event has been organized by the one and only Tim Anglade as part of his NoSQL World Tour. And this is shaping up to be one of the tour’s banner events.
There are almost 200 people signed up to see this discussion as it’s sure to be action-packed and informative. If you’re in the area and can make it out on short notice, I would recommend you attend.
October San Francisco Riak Meetup
On Thursday night, from 7-9, we are holding the October installment of the San Francisco Riak Meetup. Like last month, the awesome team at Engine Yard has once again been gracious enough to offer us their space for the event.
We have two great planned talks for this month. The first will be Basho hacker Kevin Smith talking about a feature of Riak that he has had a major hand in writing: MapReduce. Kevin is planning to cover everything from design to new code demos to the road map. In short, this should be exceptional.
For the second half of Thurday’s meetup we are going to get more interactive than usual. Articulation of use cases and database applicability is still something largely unaddressed in our space. So we thought we would address it. We are inviting people to submit use cases in advance of the meetup with some specific information about their apps. The Basho Developers are going to do some work before the event analyzing the use cases and then, with some help from the crowd, determine if and how Riak will work for a given use case – and if Riak isn’t the right fit, we might even help you find one that is. If you are curious whether or not Riak is the right database for that Facebook-killer you’re planning to build, now is your chance to find out. We still have room for one or two more use cases, so even if you’re not going to be able to attend the Thursday’s meetup I want to hear from you. Follow the instructions on the meetup page linked above to submit a use case.
That said, if you are in the Bay Area on Thursday night and want to have some beer and pizza with a few developers who are passionate about Riak and distributed systems, RSVP for the event. You won’t be disappointed.
Hope to see you there!
September 9, 2010
At long last we have all the details ironed out for the upcoming September Riak Meetup in San Francisco. The crew here in SF is quite excited about this month’s event, and here’s why:
Date: Thursday, Sept. 23rd
Location: Engine Yard Offices, located at 500 Third Street, Suite 510
- 7:15 – Riak Basics
After the first meetup, one of the attendees remarked, “Good, but looking for some basics and some hands on demo as well.” Admittedly, this is something we could have addressed a bit better. So at the beginning of this meetup (as well as all meetups moving forward) we are going to devote at least 15 minutes to discuss Riak basics. There are no stupid questions. Ask away.
- 7:30 – Riak vs Git: NOSQL Battle Royale
Presenter: Rick Olson, Github
This talk will compare and contrast Riak and Git on their merits as key/value stores, and look at how the two can work together.
- 8:00 – From Riak to RabbitMQ
Presenter: Andy Gross, Basho Technologies
This will cover using Riak to publish to RabbitMQ using post-commit hooks and gen_bunny.
- 8:30 – General Riak/Distributed Systems Conversation and Networking
Note: There is only seating for 50, so you’ll want to get there on time to secure a seat.
Basho will be providing food (pizza) and refreshments (beer, soda, etc.). And for those of you who can’t join us next Thursday, I will also be filming the talks with the goal of posting them online if everything goes to plan.
You can RSVP on the Riak Meetup Page. So go do it. Now!
Hope to see you there.
June 9, 2010
Day 2 – June 8, 2010
Keynote – Pieter Hintjens
Pieter (iMatix) started his talk with a series of high-level questions, developer-to-developer, intended to focus the audience on the fact that multi-core processing across multiple computers is the new norm, and (most) programming tools haven’t yet evolved to meet the challenge.
He then identified and discussed some of the natural patterns in software development that make things simpler. After a few examples relating to the NoSQL world, he identified three that led into his introduction of 0MQ (Pronounced Zero-M-Q):
- Asynchronous processing is a natural pattern, single threads that read from a queue, do work, and write to another queue.
- Choosing reliability over persistence is a natural pattern. If you don’t crash, you don’t need to worry about persisting data. (Not sure how I feel about this one. Just gonna roll with it for the sake of the talk.)
- Being agnostic to lines of communication is a natural pattern, what you send is orthogonal to how you send it.
Pieter then introduced the 0MQ library, which attempts to be a simple and lightweight message queue following these natural patterns. It takes care of defining queue endpoints, connecting (and re-connecting) the endpoints, buffering messages in memory, and not much else. The data format is simply a length and a binary blob, that’s all.
According to Pieter, 0MQ should be thought of as a protocol, just like TCP or UDP. In other words, 0MQ is the sort of thing you embed in your database application.
With 0MQ, you can safely create multi-threaded applications that safely leverage multiple cores by making each worker process single threaded, and have it read from a queue, perform some unit of work, and write to another queue. (In other words, the Actor model, concurrency by message passing.) The idea is not new, but it bears repeating as often as possible because it’s far simpler than multithreaded systems with locking, 99% of the time it’s the right solution, and many people still don’t know it.
Pieter had two choice quotes that drove home the main goals of 0MQ:
- “If a developer can’t pick it up in a weekend, it’s not going to work.”
- “Cheap effective messaging changes the way we think about architecting applications.”
Hypertable – Doug Judd
I unfortunately missed the first few minutes of Doug’s talk. When I arrived, Doug (Hypertable) was in the midst of an architectural overview of the Google stack, BigTable architecture, the ideas behind a log-structured merge tree, and examples of Hypertable optimizations, including bloom filters and using different compression algorithms in different parts of the system.
The money slides came toward the end, with performance comparisions claiming Hypertable to be 70% faster than HBase on random reads and sequential writes. Another chart claimed Hypertable to be multiple times faster than HBase when doing only random reads of different distributions.
Apart from the previously mentioned optimizations, there seem to be two main reasons for Hypertable’s speed: It’s based in C vs. HBase’s Java, and it is smart enough to dynamically adjust memory between caching reads and buffering writes according to the read/write distribution of the data.
And yes, Hypertable works with Hadoop for Map/Reduce-ing goodness…
Apache Cassandra Revisited – Eric Evans
Eric quickly focusing the talk by narrowing from All-of-NoSQL to Just-the-Large-Data-Projects, and then finally Just-the-BigTable-or-Dynamo-projects, which means Cassandra, HBase, Hypertable, Riak, and Voldemort.
Given these criteria, Eric called Cassandra the love-child of BigTable AND Dynamo, having influences from both. As such it has Dynamo staples like homogonous nodes, P2P-routing and partitioning (though not VNodes), and things like SSTables and (optionally) ordered data and range queries, similar to BigTable. (His slides contained a humorous, yet distrurbing picture showing a Brad Pitt/Angelina Jolie mutant child.)
Eric described the bootstrap process, the Cassandra data model (Keyspace, Column Family, Record, Column), and the interface (Thrift), and showed API examples.
He then highlighted a few key Cassandra developments and features:
- Cassandra does now support batch Map/Reduce via Hadoop.
- Cassandra comes with rack awareness, and this can be customized.
- Keyspaces and Column families, currently defined in XML, will soon be configurable via an API without a restart.
- Vector clocks will be added in the future. (But in his view, the hype outweights the benefit.)
- SuperColumns may be phased out in the future, as they are not widely used, and lead to more confusion than they are worth.
According to Eric, the largest Cassandra instance that he knows of is Twitter, with around 100 nodes holding about 170TB of data.
Massively Parallel Analytics Beyond Map/Reduce – Fabian Huske
Fabian (TU Berlin) began by describing some of the challenges behind Map/Reduce, namely that it does make big data processing more simple than it used to be, but it still requires a developer to fit his problem into something Map/Reduce shaped, and this is exacerbated by the complexities of the various Map/Reduce frameworks out there.
Fabian then introduced Stratosphere, which is a combination programming model (PACT) and execution engine (Nephele) that provides additional blocks beyond Map and Reduce that can be used instead of a simple Map or Reduce, with the dual goals of making it easier to program as well as require fewer execution phases leading to higher performance. Stratosphere is a result of combining Map/Reduce with parallel database technology.
As an example, Fabian showed a SQL task that could be converted to two Map/Reduce jobs that with Stratosphere could be made simpler using Stratosphere.
A few examples: with PACT, you have new second-order functions in which to put your user code such as operations for “cross” (compute a cartesian cross-product of inputs), “match” (compute only where input keys from both sources match), and “cogroup” (missed this one). Building more complex second-order functions allows for less user code.
Next steps for the projcet are more input contracts, flexible checkpointing and recovery, and robust and adaptive execution, with a goal of going open-source by the end of 2010.
Sqoop – Database Import and Export for Hadoop – Aaron Kimball
Aaron (Cloudera) set the stage with a quick run-down of the limits of the SQL world, and the plusses and minuses of Hadoop, which lead to the introduction of Sqoop (SQL in Hadoop).
Sqoop provides a suite of tools to connect Hadoop to a JDBC-compliant SQL database, extract data and schema information, import the data into Hadoop, auto-generate code to parse the data, and export any results back into the SQL database.
The goal is to make it easier to pull SQL-hosted data into your Hadoop-cluster for the purpose of having the data available while doing other processing. For example, clickstream data might be in Hadoop, while profile information is in SQL. With Sqoop, you can get the data into Hadoop in an efficient way to support analysis. Copying the data from SQL in one operation is better than repeatedly hitting the database while running analysis because a big Hadoop cluster can easily hose a SQL machine.
Sqoop has some complexity under the hood:
- It can export data definitions to Hive (see writeup next.)
- It reads from/write to SQL in parellel.
- You can use a SELECT/WHERE query to get data, which allows you to run Sqoop incrementally, fetching new data since the last run.
- Supports mysqldump and mysqlexport.
Hive: SQL for Hadoop – Sarah Sproehnle
Sarah (Cloudera) described Hive, a parser, optimizer, compiler, and shell for transforming SQL-like queries into Map/Reduce. With Hive, you think of your data as being in tables rather than files, so you create tables, load data from a local file or Hadoop file into the table, and can then run SQL-like queries.
(I used the word “SQL-like” above, but Hive queries are actually standards compliant SQL, with just a few limitations/twists. Anyone who knows SQL at any level can pick up the changes in just a few minutes.)
In other words, with Hive you can:
- CREATE and DESCRIBE tables, and ADD and DROP columns on tables.
- EXPLAIN queries.
- Query Hadoop data using SELECT, TOP, FROM, JOIN, WHERE, GROUP BY, and ORDER BY.
- Write Hadoop data using INSERT (though the insert actually means “clobber the old data and replace it with this new data”)
- In a twist on standard SQL, run a multi-table INSERT, where you split a single SELECT/FROM into multiple output streams, allowing you to write different columns to different tables. You can also, further filter, group, or transform the data in each stream independently before writing to the final table.
- Run data through a custom shell script, expecting lines of data on stdin, results on stdout.
- Partition and bucket data, allowing for easy ways to drop a subset of data, or take a sampling of data.
Hive gives you the convenience of SQL, but at the end of the day it’s still running as a Map/Reduce job on Hadoop, which means:
- No transactions.
- Latencies measured in minutes, not milliseconds.
- No indexes, think of everything as a full-table scan.
Not surprising, and not bad considering you can run a SQL query across Petabytes of data.
The Hive install is installed on the client, so you don’t need to do anything to the Hadoop cluster to run it. Hive keeps schema information in a Metastore, which can be kept on the local machine without any special configuration, or shared in a central repository allowing multiple users to share Hive table definitions. The schema is verified at data read time, not when the schema is created. Again, this makes sense given Hadoop’s execution model.
1,000 points to Sarah for running a live demo during the presentation. Gutsy, but always a crowd pleaser.
Talks I Wished I Had Attended
The conference schedule today had two tracks, so there were a number of talks I was not able to attend. I would have liked to see the talks below, and look forward to the conference video:
- Hadoop: An Industry Perspective – Aaron Kimball
- Behemith: A Hadoop-based platform for large scale document processing – Julian Nioche
- Introduction to Collaborative Filtering using Mahout – Frank Scholten
Isabel Drost, Jan Lehnardt, and Simon Willnauer kept the wrap-up short, thanking the other organizers, the tech staff (who gave a quick, fun recap of network usage), the venue, the presenters, and the audience.
When Jan asked who wanted to go to BerlinBuzzwords 2011 next year, every hand in the room shot up.
BerlinBuzzwords was an amazing conference. Half of the credit goes to the organizers for picking a great venue and interesting presenters. The other half goes to the largely German/European audience, who, 99% of the time, were focused on the presentation with laptops closed and (often) paper notepads open. This level of engagement lead to great questions from the audience after each presentation, and lots of hallway interaction. Sign me up for next year!