Tag Archives: Hadoop

Talend and Basho Partner for Streamlined Data Migration for Customers

March 19, 2014

When implementing Riak, our customers often need to migrate their data from their existing architecture. Depending on the setup, this can cause some pain points during the transition process. Basho has partnered with Talend to make it faster and more cost-effective for customers to migrate their data from existing infrastructure to Riak.

Distributed NoSQL databases like Riak are perfect for big data projects, which require large volumes of data to be stored with the ability to predictably scale and flexible data storage for a wide variety of constantly changing data. Legacy relational systems can’t keep up with big data needs. Through this partnership, customers looking to move from a relational system to Riak can take advantage of Talend’s powerful integration technology through its native big data integration solutions, so they can quickly transition their data to Riak.

In addition to the ability to migrate their data to Riak, users are also able to easily migrate Riak data to Hadoop for big data analytics. Riak and Hadoop are fundamentally different solutions and address different challenges (for more information, check out How is Riak Different from Hadoop). Talend makes it easier for them to work together for storage and analytics.

For more information about our Talend partnership, check out the full release.

For a complete list of partners, or to become a partner, visit our Partnerships Page.

Basho

The Weather Company Deploys Basho Riak for Next Generation Forecast and Data Services Platform

The Weather Company VP of Enterprise Data to Unveil New IT Platform at Amazon AWS re:Invent 2013

LAS VEGAS, CA – November 13, 2013 – Amazon AWS re:Invent 2013 – During the past year, The Weather Company has undergone a remarkable transformation to its technology infrastructure as a data-driven media and technology company. As a part of the transformation, The Weather Company, which oversees popular brands such as The Weather Channel, weather.com, Weather Underground, Intellicast and WSI, has selected next generation technologies to underpin its new IT platform, including Basho’s Riak, Apache Hadoop, and Dasein.

“Weather is the original big data. However, weather changes, and so must IT,” commented Bryson Koehler, executive vice president and CIO, The Weather Company. “A massive data explosion is at the center of our growth strategy. The Weather Company requires an architecture that is both flexible and reliable, allowing us to deliver higher accuracy through superior data. Basho has been a valuable partner in our transformation and Riak has proven to be a critical component as the NoSQL distributed database powering our new platform.”

Since joining The Weather Company in July 2012 as EVP and CIO, Bryson Koehler has renewed the company’s vision for technology and executed a global IT transformation to propel their business growth strategy.

The Weather Company deployed Riak across multiple Amazon Web Services (AWS) availability zones using Riak Enterprise’s Multi-Datacenter Replication for ultra high-availability. Riak is used to store a variety of data from satellites, radars, forecast models, users, and weather stations worldwide. The next-generation data services platform enables The Weather Company to expand its ability to serve superior professional weather services in major weather-influenced markets, including media, hospitality, insurance, energy, and retail.

“The Weather Company typifies the customer that Riak was designed for,” said Greg Collins, CEO of Basho Technologies. “Aligning data services with customer consumption requires a flexible, available, and operationally simple database architecture. Riak Enterprise solves these needs while also working with the The Weather Company to ensure it is able to scale to meet the demand of an increasing set of data points and be well positioned for future initiatives.”

During Amazon re:Invent 2013, Sathish Gaddipati, vice president of enterprise data at The Weather Company will unveil their newest Weather Forecast and Data Services Platform and TWC’s overall enterprise IT transformation. Sathish’s presentation, titled How the Weather Company Monetizes Weather, the Original Big Data Problem, will be Thursday, November 14 at 5:30pm at The Venetian Las Vegas.

"How is Riak different from Hadoop?"

October 28, 2013

The technology community is extremely agile and fast-paced. It can turn on a dime to solve business problems as they arise. However, with this agility comes budding terminology that can often provide false categorizations. This can lead to confusion, especially when companies evaluate new technologies based on a surface understanding of these terms. The world of data is full of these terms, including the notorious “NoSQL” and “big data.”

As described in a previous post, NoSQL is a misleading term. This term represents a response to changing business priorities that require more flexible, resilient architectures (as opposed to the traditional, rigid systems that often happen to use SQL). However, within the NoSQL space, there are dozens of players that can be as different from one another as they are from any of the various SQL-speaking systems.

Big data is another term that, while fairly self-explanatory, has been overused to the point of dilution. One reason why NoSQL databases have become necessary is because of their ability to easily scale to keep up with data growth. Simply storing a lot of data isn’t the solution though. Some data is more critical than others (and should be accessible no matter what) and some data needs to be analyzed to provide business insights. When digging into a business, big data is too vague a term to describe both of these use cases.

As these terms (to highlight a few) are used, it can lead to industry confusion. One area of confusion that we have experienced relates to Basho’s own distributed database, Riak, and the distributed processing system, Hadoop.

While these two systems are actually complementary, we are often asked “How is Riak different from Hadoop?”

To help explain this, it’s important to start with a basic understanding of both systems. Riak is a distributed database that is built for high availability, fault tolerance, and scalability. It is best used to store large amounts of critical data that applications and users need to constantly be able to access. Riak is built by Basho Technologies and can be used as an alternative to or in conjunction with relational databases (such as MySQL) or to other “NoSQL” databases (such as MongoDB or Cassandra).

Hadoop is a framework that allows for the distributed parallel processing of large data sets across clusters of computers. It was originally based on the “MapReduce” system, which was invented by Google. Hadoop consists of two core parts: the underlying Hadoop Distributed File System (HDFS), which ensures stored data is always available to be analyzed, and MapReduce, which allows for scalable computation by dividing and running queries over multiple machines. Hadoop provides an inexpensive, scalable solution for bulk data processing and is mostly used as part of an overarching analytics strategy, not for primary “hot” data storage.

One easy way to distinguish between the two is to look at some of the common use cases.

Riak Use Cases

Riak can be used by any application that needs to always have access to large amounts of critical data. Riak uses a key/value data model and is data-type agnostic, so operators can store any type of content in Riak. Due to the key/value model, certain industry use cases fit easily into Riak. These include:

  • Gaming – storing player data, session data, etc
  • Retail – underpinning shopping carts, product inventories, etc
  • Mobile – social authentication, text and multimedia storage, global data locality, etc
  • Advertising – serving ad content, session storage, mobile experiences, etc
  • Healthcare – prescription or patient records, patient IDs, health data that must always be available across a network of providers, etc

For a full list of use cases, check out our Users Page.

Hadoop Use Cases

Hadoop is designed for situations where you need to store unmodeled data and run computationally intensive analytics over that data. The original use cases of both MapReduce and Hadoop were to produce indexes for distributed search engines at Google and Yahoo respectively. Any industry that needs to do large scale analytics to better improve their business can use Hadoop. Some common examples include finance (build models to do accurate portfolio evaluations and risk analysis) and eCommerce (analyze shopping behavior to deliver product recommendations or better search results).

Riak and Hadoop are based on many of the same tenets, making their usage complementary for some companies. Many companies that utilize Riak today have created scripts, or processes, to pull data from Riak and push into other solutions (like Hadoop) for the purpose of historical archiving or future analysis. Recognizing this trend, Basho is exploring the creation of additional tools to simplify this process.

If you are interested in our thinking on these data export capabilities, please contact us.

In Summary

Every tool has its value. Hadoop excels at being used by a relatively small subset of the business to answer big questions. Riak excels at being used by a very large number of users and powering critical data for businesses.

Basho

"ZooKeeper for the Skeptical Architect" from RICON East

June 26, 2013

Camille Fournier is the VP of Technical Architecture at Rent the Runway and is an expert in distributed systems and ZooKeeper. She was also one of the speakers at RICON East, Basho’s distributed systems conference. Her talk was entitled, “ZooKeeper for the Skeptical Architect.”

ZooKeeper has become quite ubiquitous, since it’s the core component of the Hadoop ecosystem and enables high availability for systems like Redis and Solr. However, as Camille points out, just because something’s popular, doesn’t mean you should use it. To help you decide whether ZooKeeper is right for you, she goes over the core uses of ZooKeeper in the wild and why it is suited to these use cases. She also talks about systems that don’t use ZooKeeper and why that can be the right decision. Finally, she discusses the common challenges of running ZooKeeper as a service and things to look out for when architecting a deployment. Her full talk is below:

[youtube http://www.youtube.com/watch?v=j4uwKP7WJFk&w=640&h=390]

You can also check out her slide deck here.

If you’re interested in speaking at RICON West (Oct. 29-30th in San Francisco), we are now accepting proposals through July 1st. If you’re interested in attending, you can purchase early bird tickets here.

Basho

Riak and Hadoop Word Count Example

December 1, 2011

My last post was a brief description of the integration between Riak and Hadoop MapReduce. This, the follow up post, will be a bit more hands on and walk you through an example riak-hadoop MapReduce job. If you want to have a go, you need to follow these steps. I won’t cover all the details of installing everything, but I’ll point you at resources to help.

Getting Set Up

First you need both a Riak and Hadoop install. I went with a local devrel Riak cluster and a local Hadoop install in pseudo distributed mode. (A singleton node of both Hadoop and Riak installed from a package will do, if you’re pushed for time.)

The example project makes use of Riak’s Secondary Indexes (“2i”) so you will need to switch the backend on Riak to use LevelDB. (NOTE: This is not a requirement for riak-hadoop, but this demo uses 2i, which does require it). So, for each Riak node, change the storage backend from bitcask to eleveldb by editing the app.config.

riak_kv, [
%% Storage_backend specifies the Erlang module defining the storage
%% mechanism that will be used on this node.
{storage_backend, riak_kv_eleveldb_backend},
%% ...and the rest
]},

Now start up your Riak cluster, and start up Hadoop:

for d in dev/dev*; do $d/bin/riak start; done
cd $HADOOP_HOME
bin/start-dfs.sh
bin/start-mapred.sh

Next you need to pull the riak-hadoop-wordcount sample project from GitHub. Checkout the project and build it:

git clone https://github.com/russelldb/riak-hadoop-wordcount
cd riak-hadoop-wordcount
mvn clean install

A Bit About Dependencies…

Warning: Hadoop has some class loading issues. There is no class namespace isolation. The Riak Java Client depends on Jackson for JSON handling, and so does Hadoop, but different versions (naturally). When the Riak-Hadoop driver is finished it will come with a custom classloader. Until then, however, you’ll need to replace your Hadoop lib/jackson*.jar libraries with the ones in the lib folder of this repo on yourJobTracker / Namenode only. On your tasknodes, you need only remove the Jackson jars from your hadoop/lib directory, since the classes in the job jar are at least loaded. (There is an open bug about this in Hadoop’s JIRA. It has been open for 18 months, so I doubt it will be fixed anytime soon. I’m very sorry about this. I will address it in the next version of the driver.)

Loading Data

The repo includes a copy of Mark Twain’s Adventures Of Huckleberry Finn from Project Gutenberg, and a class that chunks the book into chapters and loads the chapters into Riak.

To load the data run the Bootstrap class. The easiest way is to have maven do it for you:

mvn exec:java -Dexec.mainClass=”com.basho.riak.hadoop.Bootstrap”
-Dexec.classpathScope=runtime

This will load the data (with an index on the Author field.) Have a look at the Chapter class if you’d like to see how easy it is to store a model instance in Riak.

Bootstrap assumes that you are running a local devrel cluster. If your Riak install isn’t listening on the PB interface on 127.0.0.1 on port 8081 then you can specify the transport and address like this:

mvn exec:java -Dexec.mainClass=”com.basho.riak.hadoop.Bootstrap”
-Dexec.classpathScope=runtime -Dexec.args=”[pb|http PB_HOST:PB_PORT|HTTP_URL]”

That should load one item per chapter into a bucket called wordcount. You can check it succeeded by running (being sure to adjust the url based on your configuration):

curl http://127.0.0.1:8091/riak/wordcount?keys=stream

Package The Job Jar

If you’re running the devrel cluster locally you can just package up the jar now with:

mvn clean package

Otherwise, first edit the RiakLocations in the RiakWordCount class to point at your Riak cluster/node, e.g.

conf = RiakConfig.addLocation(conf, new RiakPBLocation(“127.0.0.1″, 8081));
conf = RiakConfig.addLocation(conf, new RiakPBLocation(“127.0.0.1″, 8082));

Then simply package the jar as before.

Run The job

Now we’re finally ready to run the MapReduce job. Copy the jar from the previous step to your hadoop install directory and kick off the job.

cp target/riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar $HADOOP_HOME
cd $HADOOP_HOME
bin/hadoop jar riak-hadoop-wordcount-1.0-SNAPSHOT-job.jar

And wait… Hadoop helpfully provides status messages as it distributes the code and orchestrates the MapReduce execution.

Inspect The Results

If all went well there will be a bucket in your Riak cluster named wordcount_out. You can confirm it is populated by listing its keys:

curl http://127.0.0.1:8091/riak/wordcount_out?keys=stream

Since the WordCountResult output class has RiakIndex annotations for both the count and word fields, you can perform 2i queries on your data. For example, this should give you an idea of the most common words in Huckleberry Finn:

curl 127.0.0.1:8091/buckets/wordcount_out/index/count_int/1000/3000

Or, if you wanted to know which "f" words Twain was partial too, run the following:

curl 127.0.0.1:8091/buckets/wordcount_out/index/word_bin/f/g

Summary

We just performed a full roundtrip MapReduce, starting with data stored in Riak, feeding it to Hadoop for the actual MapReduce processing, and then storing the results back into Riak. It was a trivial task with a small amount of data, but it illustrates the principle and the potential. Have a look at the RiakWordCount class. You can see that only a few lines of configuration and code are needed to perform a Hadoop MapReduce job with Riak data. Hopefully the riak-hadoop-wordcount repo can act as a template for some further exploration. If you have any trouble running this example, please let me know by raising a GitHub issue against the project or by jumping on the Riak Mailing List to ask some questions.

Enjoy.

Russell

Riak and Hadoop (Sitting in a tree)

November 29, 2011

It has been pointed out on occasion that Riak MapReduce isn’t real “MapReduce”, often with reference to Hadoop, which is. There are many times that Riak’s data processing pipeline is exactly what you want, but in case it isn’t, and you want to leverage existing Hadoop expertise and investment, you may now use Riak as an input/output, instead of HDFS.

This started off as a tinkering project, and it is currently released as riak-hadoop-0.2. I wouldn’t recommend it for production use today, but it is ready for exploratory work, whilst we work on some more serious integration for the future.

Input

Hadoop M/R usually gets its input from HDFS, and writes its results to HDFS. Riak-hadoop is a library that extends Hadoop’s InputFormat and OutputFormat classes so that a Riak cluster can stand in for HDFS in a Hadoop M/R job. The way this works is pretty simple. When defining your Hadoop job, you declare the InputFormat to be of type RiakInputFormat. You configure you cluster members and locations using the JobConf, and a helper class called RiakConfig. Your Mapper class must also extend RiakMapper, since there are some requirements for handling eventual consistency that you must satisfy. Apart from that, you code your Map method as if for a typical Hadoop M/R job.

Keys, for the splits

When Hadoop creates a Mapper task it assigns an InputSplit to that task. An input split is the subset of data that the Mapper will process. In Riak’s case this is a set of keys. But how do we get the keys to Map over? When you configure your job, you specify a KeyLister. You can use any input to Hadoop M/R that you would use for Riak M/R: provide a list of bucket/key pairs, a 2i query, a Riak Search query, or, ill advisedly, a bucket. The KeyLister will fetch the keys for the job and partition them into splits for the Mapper tasks. The Mapper tasks then access the data for the keys using a RiakRecordReader. The record reader is a thin wrapper around a Riak client, it fetches the data for the current key when the Hadoop framework asks.

Output

In order to output reduce results to Riak your Reducer only need implement the standard Reducer interface. When you configure the Job, just specify that you wish to use the RiakOutputFormat, and declare an output bucket as a target for results. The keys/values from your reduce will then be written to Riak as regular Riak objects. You can even specify secondary indexes, Riak metadata and Riak links on your output values, thanks to the Riak Java Client’s annotations and object mapping (courtesy of Jackson’s object mapper.)

Hybrid

Of course you don’t need to use Riak for both input and output. You could read from HDFS, process and store results in Riak, or read from Riak and store results in HDFS.

Why do this?

This is really a proof of concept integration. It should be of immediate use to anyone who already has Hadoop knowledge and a Hadoop cluster. If you’re a Riak user with no Hadoop requirements right now, I’d say, don’t go there at once: setting up a Hadoop cluster is way more complex than running Riak, and maintaining it is, operationally, taxing. If, however, you already have Hadoop, adding Riak as a data source and sink is incredibly easy, and gives you a great, scalable, live database for serving reads and taking writes, and you can leverage your existing Hadoop investment to aggregate that data.

What next?

The thinking reader might be saying “Huh? You stream the data in and out over the network, piecemeal?”. Yes, we do. Ideally we’d do a bulk, incremental replication between Riak and Hadoop (and back) and that is the plan for the next phase of work.

Summary

Riak-Hadoop enables Hadoop users to use a Riak cluster as a source and sink for Hadoop M/R jobs. This exposes he entire Hadoop toolset to Riak data (including the query languages like Hive and Pig!) This is only a first phase pass at the integration problem, and though usable today, smarter sync is coming.

Please clone, build, and play with this project. Have at it. There’s a follow up post with a look at an example Word Count Hadoop Map/Reduce job coming soon. If you can’t wait, just add a dependency on riak-hadoop, version 0.2 to your pom.xml and get started. Let me know how you get on, via the Riak mailing list.

Russell