Amazon Outage Proves Value of Riak’s Vision

April 21, 2011

The wide-reaching and well documented Amazon outage that took down sites like Hootsuite, Reddit, Foursquare, etc. today was a pain in the ass for a lot of people. I mean, how annoying is it having to actually use the main Twitter web page? (I’m on a new laptop so didn’t feel like installing Tweetdeck again.)

But while I was slightly annoyed by the outage, imagine if you were Reddit or Foursquare! By leveraging Amazon to scale quickly and not have tons of data center overhead, these guys made a gamble that has already proven too risky in a sense. Yes, you can scale quickly and without a ton of costs. But what happens to your customer experience, and your data, when outages like this happen? And Reddit and Foursquare are just some fun social apps, what about the web applications and architectures leveraging Amazon that were trying top manage financial transactions? Could you imagine closing your store for a day? Ugh…

If there is any silver lining to all of this, it’s that it proves the value of having an intelligently distributed, eventually consistent and highly available data and application network. Apart from the enhanced throughput, faster querying, etc. that comes with non-relational approaches to data – it is important to realize that a distributed data platform like RiakEDS can make issues like this almost obsolete. (So, make sure the NoSQL database you’re getting psyched about lately can actually easily manage being deployed across different cloud environments, multiple nodes, clusters, what have you…)

If a site were using RiakEDS across Amazon, Joyent, VMWare’s Private Cloud, their own data center, etc. – they would be virtually impervious to failure. (I am sure there are quite a few conditions that would make this untrue.) The idea here is to, as Basho account manager Matt "Roder" says so frequently is to “be able to shoot a data center in the head and not lose one byte of data.”

Spinning up Oracle 11i or MySQL on Amazon will not allow this to be the case, period.

A nice explanation of the downtime is here (below), which shows the difficulty of building high availability (around the 1:30 mark):

Martin

blog comments powered by Disqus