Write it like Riak. Analyze it like Spark.

To reveal valuable patterns, trends, and associations, Big Data applications need to process data in real time. With the Riak KV Spark Connector, you can move data from Riak KV to Apache Spark for enhanced in-memory analytics, and then store the results in Riak KV for future data processing.

Riak KV with the Spark Connector combines the real-time operational analytics of Spark with the availability and scalability of Riak KV.
 

OPERATIONAL ANALYTICS MUST BE FAST

Big Data means Big Analytics. You can boost the power of your analytics by adding Apache Spark to the availability and scalability of Riak KV. Apache Spark is an analytics framework, and Riak KV is built to store massive amounts of unstructured data. Together, they allow you to do real-time operational analytics.
The Apache Spark Connector supports both batch and streaming analysis, meaning you can use a single framework for your batch processing as well as your real-time analytics on operational data.

The Spark Connector allows you to expose data stored in Riak KV as Spark Resilient Distributed Datasets (RDDs) or DataFrames, as well as output data from Spark RDDs or DataFrames into Riak KV.

Spark Connector features:

  • Construct a Spark RDD from a Riak KV bucket with a set of keys
  • Construct a Spark RDD from a Riak KV bucket by using a 2i string index or a set of indexes
  • Construct a Spark RDD from a Riak KV bucket by using a 2i range query or a set of ranges
  • Map JSON formatted data from Riak KV to user-defined types
  • Save a Spark RDD into a Riak KV bucket and apply 2i indexes to the contents
  • Construct a Spark RDD using Riak KV bucket’s enhanced 2i query (a.k.a. full-bucket read)
  • Perform parallel full-bucket reads from a Riak KV bucket into multiple partitions

Loading Data from Riak KV into Spark

The example below shows a full-bucket read using a single command.

SCALA

val data = sc.riakBucket[String](new Namespace("bucket-full-of-data"))
.queryAll()

 

If you want specific results and know your keys by name, you can pass them in directly:

SCALA

val rdd = sc.riakBucket(new Namespace("FOO"))
.queryBucketKeys("mister X", "miss Y", "dog Z")

 

The example below shows a range of values (e.g. 1 – 5000) defined by a numeric 2i index where the bucket is named “Bar” and the index is “myIndex”:

SCALA

val rdd = sc.riakBucket(new Namespace("BAR"))
.query2iRange("myIndex", 1L, 5000L)

 

BENEFITS OF SPARK CONNECTOR IN RIAK KV

Big Data applications require fast analytics that scale as the data grows. Riak KV with the Spark Connector gives you high availability, scalability, and real-time analytics.

Make real-time decisions
Whether you make on-demand recommendations, or get automated alerts and analysis of events as they happen, advanced analytics is key to driving and guiding your business. Riak KV with the Spark Connector lets you integrate analytics into every business decision by providing fast, large-scale data analysis.

Increase performance and scale
As Big Data applications grow, you need a solution that not only analyzes data sets fast, but also scales easily on demand. Riak KV with the Spark Connector provides high performance analytics and near-linear scale using commodity hardware.

Faster time to market
Big Data applications require complex analytics. The Riak KV Spark Connector
simplifies working with Riak KV and Spark. Developers get a broad set of APIs to write complex aggregations. This means you can do more complex processing with less effort, allowing you to complete your applications faster, and to get to market sooner.

Alert Logic“Alert Logic depends on the reliable processing of massive amounts of machine data and turning that into actionable information. Our security operations center depends on this information for analysis to detect and respond to real-time security incidents that occur on our customers’ networks. We selected Riak KV for scalability and fault-tolerance, and it continues to be a vital component helping ensure that the Alert Logic Platform can scale to keep up with our rapid growth.”

– Paul Fisher, Director of Platform Services, Alert Logic

  1.  RESILIENCY
  2. MASSIVESCALABILITY
  3. OPERATIONALSIMPLICITY
  4. INTELLIGENTREPLICATION
  5. COMPLEXQUERY SUPPORT
  6. GLOBAL OBJECTEXPIRATION
  7. DOTTED VERSIONVECTORS (DVVs)
  8. RIAK DATA TYPES
  9. ROBUST APIs &CLIENT LIBRARIES
  10. APACHE SPARKCONNECTOR
  11. APACHE MESOSFRAMEWORK
  12. REDIS CACHEINTEGRATION
  13. MULTI-CLUSTERREPLICATION