Spark Connector

Write it like Riak. Analyze it like Spark.

To reveal valuable patterns, trends, and associations, Big Data applications need to process data in real time. With the Riak KV Spark Connector, you can move data from Riak KV to Apache Spark for enhanced in-memory analytics, and then store the results in Riak KV for future data processing.

Riak KV with the Spark Connector combines the real-time operational analytics of Spark with the availability and scalability of Riak KV.

OPERATIONAL ANALYTICS MUST BE FAST

Big Data means Big Analytics. You can boost the power of your analytics by adding Apache Spark to the availability and scalability of Riak KV. Apache Spark is an analytics framework, and Riak KV is built to store massive amounts of unstructured data. Together, they allow you to do real-time operational analytics.
The Apache Spark Connector supports both batch and streaming analysis, meaning you can use a single framework for your batch processing as well as your real-time analytics on operational data.

The Spark Connector allows you to expose data stored in Riak KV as Spark Resilient Distributed Datasets (RDDs) or DataFrames, as well as output data from Spark RDDs or DataFrames into Riak KV.

Spark Connector features:

Construct a Spark RDD from a Riak KV bucket with a set of keys
Construct a Spark RDD from a Riak KV bucket by using a 2i string index or a set of indexes
Construct a Spark RDD from a Riak KV bucket by using a 2i range query or a set of ranges
Map JSON formatted data from Riak KV to user-defined types
Save a Spark RDD into a Riak KV bucket and apply 2i indexes to the contents
Construct a Spark RDD using Riak KV bucket’s enhanced 2i query (a.k.a. full-bucket read)
Perform parallel full-bucket reads from a Riak KV bucket into multiple partitions

Loading Data from Riak KV into Spark

The example below shows a full-bucket read using a single command.

SCALA



val data = sc.riakBucket[String](new Namespace("bucket-full-of-data"))

    .queryAll()

If you want specific results and know your keys by name, you can pass them in directly:

SCALA



val rdd = sc.riakBucket(new Namespace("FOO"))

      .queryBucketKeys("mister X", "miss Y", "dog Z")

The example below shows a range of values (e.g. 1 – 5000) defined by a numeric 2i index where the bucket is named “Bar” and the index is “myIndex”:

SCALA



val rdd = sc.riakBucket(new Namespace("BAR"))

      .query2iRange("myIndex", 1L, 5000L)

BENEFITS OF SPARK CONNECTOR IN RIAK KV

Big Data applications require fast analytics that scale as the data grows. Riak KV with the Spark Connector gives you high availability, scalability, and real-time analytics.

Make real-time decisions
Whether you make on-demand recommendations, or get automated alerts and analysis of events as they happen, advanced analytics is key to driving and guiding your business. Riak KV with the Spark Connector lets you integrate analytics into every business decision by providing fast, large-scale data analysis.

Increase performance and scale
As Big Data applications grow, you need a solution that not only analyzes data sets fast, but also scales easily on demand. Riak KV with the Spark Connector provides high performance analytics and near-linear scale using commodity hardware.

Faster time to market
Big Data applications require complex analytics. The Riak KV Spark Connector
simplifies working with Riak KV and Spark. Developers get a broad set of APIs to write complex aggregations. This means you can do more complex processing with less effort, allowing you to complete your applications faster, and to get to market sooner.