February 1st, 2015
If you missed last week’s webinar Preparing for the Deluge of Unstructured Data you can still watch it on-demand. Dorothy Pults and I discuss the news emanating from the 2015 Consumer Electronics show and highlight that the Internet of Thing, connected devices, and the resulting explosion of unstructured data are front and center of growth trends in 2015. In particular, we covered the topics of:
- What is driving the growth in unstructured data
- The challenges associated with managing unstructured data
- How companies are capitalizing on the opportunities that unstructured data presents, to save money, time, and create new market opportunities
The webinar covers each of these topic in great details and provides some insights on distributed systems.
Why Distributed Systems?
Companies like Facebook, Amazon, and Google have built huge distributed systems with strict requirements around scalability, fault tolerance, and global footprints. These same concepts must now be considered by companies of all sizes…from the Enterprise to the startup.
The reality is that everything works at small scale. Challenges arise as it becomes necessary to scale out, up and down, predictably and linearly. When assuming that failure and latency are part of the equation, it is necessary to choose a distributed database that enables horizontal scale. And, similarly, that it enables this scale on commodity hardware or the compute instance that your business has adopted in its architecture. This is particularly important when data governance is a key component of your design considerations.
Ultimately, the customer experience matters. When designing your distributed architecture, and choosing persistence solutions like Riak, ensure that there is a solution for the geographic distribution of data (like Riak Enterprise’s multi-datacenter replication capability) to provide low latency experiences for your customers, regardless of their physical location.
For more information on this topic space, we have compiled a few resources to enable your education and decision-making.
January 22, 2015
In speaking with Riak users, both open source and commercial, we are frequently told that Riak’s key/value model is more flexible and faster to develop against than a traditional relational database. Even though Riak is well suited for many applications, there are inevitable tradeoffs in terms of query options and data types that are available. With a key/value model, there is no concept of columns or rows, therefore Riak does not have join operations. Riak can be queried either directly via HTTP, the protocol buffers API and through various client libraries. However, there is no SQL or SQL-like language that is currently available.
Riak’s key/value data model does not preclude queryability. There are several powerful querying options including:
- Riak Search: Integration with Apache Solr provides full-text search and support for Solr’s client query APIs.
- Secondary Indexes: Secondary Indexes (2i) give developers the ability to tag an object stored in Riak with one or more query values. Indexes can be either integers or strings, and can be queried by either exact matches or ranges of values.
- MapReduce: Developers can leverage Riak MapReduce for tasks like filtering documents by tag, counting words in documents, and extracting links to related data.
For more information, check out the Riak documentation on Querying Data.
The table below illustrates key/value mappings for common application types. Remember that values in Riak are opaque and stored on disk as binaries – JSON or XML documents, images, text, etc. The way data is organized in Riak should take into account the unique needs of the application, including access patterns such as read/write distribution, latency differences between various operations, use of Riak features (including MapReduce, Search, Secondary Indexes), and more.
|Session||User/Session ID||Session Data|
|Advertising||Campaign ID||Ad Content|
|Sensor||Date, Date/Time||Sensor Updates|
|User Data||Login, eMail, UUID||User Attributes|
|Content||Title, Integer||Text, JSON/XML/HTML Document, Images, etc.|
Consider, for example, one of the canonical use cases for Riak…storing user and session data. In a relational database, the “users” table is well known and, basically, provides a unique identifier per user, and then a series of identifying information about that user as individual columns such as:
- First name
- Last name
- Counter of Site Visits
- Paid Account Identifier
This data can then be used to correlate or count, paid users, common interests, etc. via a series of SQL queries against the row/column structure of the users table.
Riak, in contrast, provides flexibility in how this data can be modeled based upon the application use case. It may be desirable to create a Users bucket, with the UserName (or Unique Identifier) as the key and a JSON object storing all user attributes as the value. Or, as we describe in Data Modeling with Riak Data Types, leverage the power of Riak Data Types by creating a map type for each user storing:
- first and last name strings in the register type,
- interests as a set,
- a counter for visits,
- and a flag for paid account identifier.
One of the best ways to enable application interaction with objects (a key/value pair) in Riak is to provide structured bucket and key names for the objects. This approach often involves wrapping information about the object in the object’s location data itself.
For example, appending a timestamp, UUID, or Geographical coordinate, to a key’s name allows for fine grained queryability via simple lookup to locate and retrieve a specific set of information. Leveraging the same naming mechanism as created for users (UniqueID as the key) enables, in a separate sessions bucket, storing the UUID append with a timestamp as the key and the session data (in binary format) as the object. In this way, using the same UUID, I am able to obtain both user and session data stored in different buckets and in different formats.
For additional information, and more complex considerations such as modeling relationship and advanced social applications, see the Riak documentation on use cases and data modeling.
Resolving Data Conflicts
In any system that replicates data, conflicts can arise – e.g., if two clients update the same object at the exact same time or if not all updates have yet reached hardware that is experiencing lag. Riak is “eventually consistent” – while data is always available, not all replicas may have the most recent update at the same time, causing brief periods (generally on the order of milliseconds) of inconsistency while all state changes are synchronized.
However, Riak does provide features to detect and help resolve the statistically small number of incidents when data conflicts occur. When a read request is performed, Riak looks up all replicas for that object. By default, Riak will return the most updated version, determined by looking at the object’s vector clock. Vector clocks are metadata attached to each replica when it is created. They are extended each time a replica is updated to keep track of versions. Clients can also be allowed to resolve conflicts themselves.
Further, when an outdated object is discovered as part of a read request, Riak will automatically update the out-of-sync replica to make it consistent. Read Repair, a self-healing property of the database, will even update a replica that returns a “not_found” in the event that a node loses it due to physical failure.
Riak also features “Active Anti-Entropy,” which is an automatic self-healing property that runs in the background. Rather than waiting for a read request to trigger a replica repair (as with Read Repair), Active Anti-Entropy constantly uses a hash tree exchange to compare replicas of objects and automatically repairs or updates any that are divergent, missing, or corrupt. This can be beneficial for large clusters storing “stale” data.
More information on vector clocks, dotted version vectors, and conflict resolution can be found in the online documentation in the section regarding Causal Context.
Multi-site replication is quickly becoming critical for many of today’s platforms and applications. Not only does replication across multiple clusters provide geographic data locality – the ability to serve global traffic at low-latencies – it can also be an integral part of a disaster recovery or backup strategy. Other teams may use multi-site replication to maintain secondary data stores, both for failover as well as for performing intensive computation without disrupting production load. Multi-site replication is included in Basho’s commercial extension to Riak, Riak Enterprise, which also includes 24/7 support.
Multi-site replication in Riak works differently than the typical approach seen in the relational world, multi-master replication. In Riak’s multi-datacenter replication, one cluster acts as a “primary cluster.” The primary cluster handles replication request from one or more “secondary clusters” (generally located in datacenters in other regions or countries). If the datacenter with the primary cluster goes down, a secondary cluster can take over as the primary cluster. In this sense, Riak’s multi-datacenter capabilities are “masterless.”
In multi-datacenter replication, there are two primary modes of operation: full sync and real-time. In full sync mode, a complete synchronization occurs between primary and secondary cluster(s). In real-time mode, continual, incremental synchronization occurs – replication is triggered by new updates. Full sync is performed upon initial connection of a secondary cluster, and then periodically (by default, every 6 hours). Full sync is also triggered if the TCP connection between primary and secondary clusters is severed and then recovered.
Data transfer is unidirectional (primary->secondary). However, bidirectional synchronization can be achieved by configuring a pair of connections between clusters.
Full documentation for multi-datacenter replication in Riak Enterprise is available in the online documentation.
Modeling data in any non-relational solution requires a different way of thinking about the data itself. Rather than an assumption that all data cleanly fits into a structure of rows and columns, the data domain can be overlayed on the core Key/Value store (Riak) in a variety of ways. There are, however, distinct tradeoffs and benefits to understand.
Relational Databases have:
- Foreign keys and constraints
- Sophisticated query planners
- Declarative query language (SQL)
- A Key/Value model where the value is any unstructured data
- More data redundancy that provides better availability
- Eventual consistency
- Simplified query capabilities
- Riak Search
What you will gain:
- More flexible, fluid designs
- More natural data representations
- Scaling without pain
- Reduced operational complexity
For more information on Data Modeling, or to chat with a member of the Basho team on the topic, please request a Tech Talk.
New executive team secured additional funding and drove strong bookings growth
BELLEVUE, Wash. – January 13, 2015 – Basho Technologies, the creator and developer of Riak®, the industry leading distributed NoSQL database, today announced record 2014 sales growth along with closure of a $25 million Series G funding round led by existing investor Georgetown Partners. The financing is being used to expand development and marketing activities.
The company achieved several critical milestones in 2014:
- Grew bookings 62 percent sequentially in Q3 and 116 percent sequentially in Q4
- Grew bookings 88 percent from second half 2013 to second half 2014
- Ended 2014 with 87 percent licensing, 13 percent professional services revenues
- Closed numerous multi-million dollar enterprise deals
- Shipped Riak 2.0
- Shipped Riak CS 1.5
- Replaced Oracle at National Health Service of UK
“The new Basho management team has made strong progress in positioning the company to capitalize on growth opportunities for solutions that enable enterprises to extract value from the massive amounts of data they generate,” said Chester Davenport, chairman of Basho Technologies and managing director of Georgetown Partners. “Riak and Riak CS software have extremely strong product roadmaps for 2015 and sales momentum is impressive. With Series G funding secured, I have confidence Basho will establish itself as a leading unstructured data solutions provider in 2015.”
In March, Basho announced Adam Wray, formerly CEO of Tier 3, as CEO and Dave McCrory, formerly of Warner Music Group and VMware, as CTO. The company also added executive leadership for product, engineering, finance and EMEA management.
Basho has been widely recognized for innovation in distributed systems since being founded in 2008 and Riak has been deployed by more than 30 percent of the Fortune 50. The company experienced a significant increase in enterprise adoption in 2014 in a variety of industries, including advertising, financial services, gaming, retail and healthcare, replacing Oracle at the United Kingdom’s National Health Service.
The Weather Company, which oversees popular brands such as The Weather Channel, weather.com, Weather Underground, Weather Central and WSI, initially selected Riak Enterprise with its Multi-Datacenter Replication capabilities while still being extremely lightweight, easy to use and simple software.
“The amount of data we collect from satellites, radars, forecast models, users and weather stations worldwide is over 20TB each day and growing quickly. This data helps us deliver the world’s most accurate weather forecast as well as deliver more severe weather alerts than anyone else, so it is absolutely mission critical and has to be available all of the time,” said Bryson Koehler, executive vice president and CITO for The Weather Company. “Riak Enterprise Software gives us the flexibility and reliability that we depend on to enable over 100,000 transactions a second with sub 20ms latency on a global basis.”
Tapjoy first deployed Riak software to guarantee performance and uptime, even with peak traffic. It found that Riak helped them keep costs down, decrease engineering complexity, and reduce operational effort due to its ease of use and general stability.
“Two years ago, we implemented Riak Enterprise Software due to its high availability, operational simplicity, and ability to scale,” said Wes Jossey, head of operations at Tapjoy. “When we began, our clusters typically moved around 40,000 operations per second at peak. Today, we now see peaks well over 250,000 operations per second, all while sustaining sub-millisecond response times and rock solid stability. Despite this massive change in growth, we still do not employ any full-time engineers to work on our Riak cluster. It’s really that easy to use.”
OpenX became a Basho customer in 2012 to address multi datacenter replication and to consolidate the number of databases it was using to support its ad trafficking system. Riak Enterprise Software meets OpenX high availability and scale objectives with its multi-data center replication achieving over a billion daily real-time ad requests from a global audience.
“Basho has established themselves as a key OpenX partner,” said Matt Davis, site reliability engineer at OpenX. “They have worked with us in true partnership fashion to keep up with our rapidly scaling business and have always addressed our concerns in a timely manner. As a supporter of both Riak software and the greater Erlang community, OpenX appreciates the strong engineering prowess at Basho.”
“Worldwide demand for NoSQL technologies is driving our growth and greatly expanding our large enterprise deployments,” said Wray. “As NoSQL moves into primetime, we’re seeing more enterprises seek solutions that address a broad range of unstructured data requirements and we expect this trend to increase rapidly. We set aggressive product and sales goals for 2014 and I couldn’t be happier with our achievements this year. We look forward to continuing this acceleration into 2015 and beyond.”
- Basho Website (http://basho.com)
- Basho Blog (http://basho.com/blog/)
- Riak® software (http://basho.com/riak/)
- Riak® CS software (http://basho.com/riak-cloud-storage/)
- Additional Resources (http://basho.com/resources/)
- Twitter: @Basho (https://twitter.com/basho)
- LinkedIn (https://www.linkedin.com/company/basho-technologies-inc)
About Basho Technologies
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed database, Riak®, the industry leading distributed NoSQL database, and Basho’s cloud storage software, Riak® CS, are used by fast growing Web businesses and by one third of the Fortune 50 to power their critical Web, mobile and social applications, and their public and private cloud platforms.
Riak is the registered trademark of Basho Technologies, Inc. Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s Riak software is the industry leading distributed NoSQL database software. Basho’s Riak CS software is cloud storage software used by fast growing Web businesses and by one third of the Fortune 50 to power their critical Web, mobile and social applications, and their public and private cloud platforms.
Riak software and Riak CS software are available open source. Riak Enterprise Software and Riak CS Enterprise Software offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com.
# # #
January 6, 2015
If you have read about Riak, or seen a member of the Basho team present, you have probably heard the phrase “Your data is opaque to Riak.” While this is not, strictly, true with the inclusion of distributed Data Types in Riak 2.0, it was a phrase that hinted at the core structure of Riak itself.
Riak is a Key Value data store.
In a relational database, data is organized by tables that are separate and unique structures. Within these tables exist rows of data organized into columns. As such, interaction with the database is by retrieving or updating entire tables, individual rows, or a group of columns within a set of rows.
In contrast, Riak has a much simpler data model. An Object is both the largest and smallest element of data. As such, interaction with the database is by retrieving or modifying the entire object. There is no partial fetch or update of the data.
Keys in Riak are simply a binary value (or a string) that are used to identify Objects. The Key/Value pair (or Object) is stored in a higher level namespace called a Bucket. And, with Riak 2.0, there is an extra layer of abstraction known as Bucket Types.
This Key/Value/Bucket model enables broad flexibility in modeling the applications data domain with Riak as the data store for persistence.
Another NoSQL model that many are familiar with is the document store. Unlike the Key/Value model the data store is aware of the structure of the objects stored. These objects, or documents, are grouped into “collections” — which is analogous to a relational “table” — and the datastore provides a query mechanism to search collections for objects with particular attributes. When the data that is being persisted is easily rendered as a JSON document, a document store can seem a natural fit. Some common use cases include product catalog data and content management systems.
The Basho Docs have a lengthy tutorial entitled Using Riak as a Document Store that walks you through the process of leveraging Riak as a document store for a CMS. There are many approaches to modeling, but the tutorial demonstrates the power of Riak 2.0 features by combining the maps data type and indexing that data with Riak Search.
When the data you are persisting can be represented as JSON, and you require the ability to query the data, Riak 2.0 is an excellent solution for persisting and modeling document data. The flexibility of the Key/Value model, combined with the power of Riak Search and Riak Data Types, provide you with a highly scalable, highly available document store with rich, full-text query capabilities. In addition, the inclusion of the maps data type means that you don’t have to write complex client side resolution logic when faced with network partitions. Riak Data Types handle that conflict resolution automatically.
A scalable, available document store that is operationally simple may seem compelling enough to use Riak. But when you combine the characteristics of Riak with the multi-datacenter replication capabilities of Riak Enterprise, now you have a solution that enables you to bring your data operations closer to the end user.
Scalable, available, operationally simple, and replicated. That’s the power of using Riak as a document store.
December 18, 2014
One of the interesting things about attending industry events, like AWS re:Invent, is identifying common trends that arise in conversations. Recent conversations point to a renewed interest in “enterprise ready replication” for NoSQL databases.
As business data continues to grow, there is an entirely new set of challenges that are presented related to availability, scalability, and fault-tolerance. While most NoSQL databases work at small scale, availability is often compromised as applications reach full production or peak capacity. Having the right replication functionality is key to ensuring that availability requirements are not compromised as your system grows.
“Replication” may mean different things based on context. In this case, we are referring to the movement of data in a database cluster — or across datacenters — for the purpose of redundancy or data locality. If your database experience began in an RDBMS context, then replication implies a specific contextual understanding of multi-master transactional deployment and, perhaps, shipping transaction logs between incremental backups in a hot/warm database configuration. In contrast, for those who began in the NoSQL era, the term may evoke images of replica-sets on a sharded infrastructure and the operational overhead associated therewith.
In a distributed NoSQL database, like Riak, the term replication is used to encompass two distinct concepts. First, intra-cluster replication for high availability and fault tolerance within the datacenter; and second, multi-datacenter replication for data locality and global availability. There is none of the complexity of log shipping or dealing with a sharded infrastructure.
Data replication is a core feature of Riak’s basic architecture. Riak was designed to operate as a clustered system containing multiple nodes (commodity servers or cloud instances). The replication implementation allows data to live on multiple machines at once, with a single write request, in case a node in the cluster goes down or is unavailable due to issues like network partitioning.
Intra-cluster replication is fundamental and automatic in Riak, so that your data is always available. All data stored in Riak is replicated to a number of nodes in the cluster according to a configurable parameter (
n_val) set in a buckets bucket type.
With the default
n_val setting of 3, there are always three copies of all data. These copies will be on three different partitions/vnodes. A detailed explanation and analysis of this replication capability is discussed in the Riak documentation – Understanding replication by example.
In the case of intra-cluster replication, or what we would refer to simply as “replication”, data distribution ensures redundant data such that high availability is maintained in a failure state.
In contrast to intra-cluster replication, multi-datacenter replication (a feature of Riak Enterprise) is a critical part of modern application infrastructures. Riak Enterprise offers multi-datacenter replication features so that data stored in Riak can be replicated to multiple sites (vs. multiple servers in the same site).
As we are all aware, understanding application latency (for an end user) begins with the realization data can’t travel faster than the speed of light. So, inherently, as source information moves further from it’s consumption latency is introduced. As such, there is a set amount of latency for a customer connecting to your application hosted in New York when they are accessing the application from San Francisco. This latency profile increases, and becomes more complex, as the geographic distribution of your customer base increases.
With multi-datacenter replication in Riak Enterprise, data can be replicated across locations and geographic areas providing for disaster recovery, data locality, compliance with regulatory requirements, the ability to “burst” peak loads into public cloud infrastructure, amongst others.
Riak’s multi-datacenter replication is masterless. One cluster acts as a primary, or source, cluster. The primary cluster handles replication requests from one or more secondary, or sink, clusters (generally located in datacenters in other regions or countries). If the datacenter with the primary cluster goes down, a secondary cluster can automatically take over as the primary cluster.
More architectural strategies for multi-datacenter implementations, are covered in the Basho whitepaper entitled Riak Enterprise: Multi-Datacenter Replication – A Technical Overview & Use Cases or in the Basho Documentation section Multi-Datacenter Replication: v3 Architecture.
Replication, inside a cluster, is a core design tenant of Riak. This is used to provide the availability and fault-tolerance characteristics — with a low operational overhead — that many unstructured data workloads demand.
Multi-datacenter replication, while related, is an entirely different approach and architecture to enable the geographic distribution of data to solve for redundancy, geo-data locality, etc.
Replication is an important scalability feature of any database deployment. Ensuring that your NoSQL database replicates data in a way that is scalable, operationally simple and achieves your business objectives is key to your success.
For years, the press and industry analysts have been telling us that cloud is mainstream, but the reality is that Enterprises must shift their workloads to the cloud in an orderly, low risk manner. While there are many applications already built and running in the cloud, there are many new (or underutilized and, perhaps, misunderstood) technologies like Docker, Chef and object storage that are changing the way cloud applications are implemented.
At RICON 2014, Basho worked with Citrix to host “Build a Cloud Day.” Build a Cloud day sessions explore new technologies and show how to bring some order to the chaos of moving workloads to the cloud. The attendees learn the concepts and best practices to deploy a cloud computing environment using Apache CloudStack and other cloud infrastructure tools, including those from XenServer, Docker, RiakCS, Chef, Zenoss, Puppet and many others that automate server and network configuration for building highly available cloud computing environments.
Cloud Architecture: Virtualization, Orchestration and Storage
“Build a Cloud Day” started with an excellent presentation by Mark Hinkle. Many of us know him as @mrhinkle. Mark is the Senior Director of Open Source Solutions at Citrix Systems where he helps support the Apache CloudStack and Xen.org projects.
Mark has an excellent grasp of cloud computing and provides an overview of cloud computing architecture and the open source software that can be used to deploy and manage a cloud-computing environment. He looks at virtualization and containers and provides a brief description of Docker and how it is being used in today’s applications.
He also provides an overview of OpenStack. Mark closes the presentation with insights into how to deliver Platform-as-a-Service (PaaS) and what technologies can be used to compliment this evolving cloud computing paradigm.
Software is Eating Infrastructure
Other presenters at “Build a Cloud Day” included Basho’s own John Burwell (@john_burwell). John is a Senior Software Engineer at Basho Technologies. He also serves as an Apache CloudStack PMC member and committer focused on storage architecture and security integration. John’s talk explores cloud design strategies to achieve high availability and reliability using commodity components and how to apply these strategies using Apache CloudStack and Riak CS.
By migrating reliability and scalability responsibilities up the stack from specialized hardware to software, cloud orchestration platforms such as Apache CloudStack (ACS) and object stores such as Riak CS increase the utilization and density of compute and storage resources by dynamically shifting workloads based on demand.
John describes two workloads predominately managed in cloud environments — traditional virtualization and cloud — and how to use Apache CloudStack to efficiently manage both simultaneously. He then explores storage design to support this dual workload model, including the use of Riak CS with Apache CloudStack to reduce infrastructure costs without sacrificing reliability.
Riak CS provides software-defined, fault-tolerant object storage uniquely built to handle a variety of unstructured and big data needs using commodity hardware.
Apache CloudStack, Apache Brooklyn and more…
There were many great presentations at “Build a Cloud Day” including:
- Primary Storage in CloudStack by Mike Tutkowski (Slides | Video)
- Introduction to Apache CloudStack by David Nalley (Slides | Video)
- Hypervisor Selection in the Cloud by Tim Mackey (Slides | Video)
- Cloud Application Blueprints with Apache Brooklyn by Alex Henevald (Slides | Video). Alex also did a Riak-specific presentation at RICON 2014, Running Riak in a Docker Cloud using Apache Brooklyn.
You can find out more about RICON 2014 in our blog post. http://basho.com/wrapping-up-ricon-2014/.
The videos of the presentations at RICON 2014 can be found on our RICON Archive site. The Keynote by Peter Alvaro – Outwards from the Middle of the Maze is very popular.
November 10, 2014
Many data needs are better served by data stores that are optimized for maximum availability and scalability — rather than optimized for consistency. For certain use cases, there are elements to the data that require strong consistency. With Riak 2.0, in addition to eventual consistency, there is now a way to enforce strong consistency when needed.
NOTE: Riak’s strong consistency feature is currently an open-source-only feature and is not yet commercially supported.
Behavioral Changes with Strong Consistency
Strongly consistent operations in Riak function much like eventually consistent operations at the application level. The core difference lies in the types of errors Riak will report to the client.
Each request to update an object (except for the initial creation) must include a context value reflecting the last time the application read it. This is the same behavior that Riak clients have always followed with version vectors and strong consistency also mandates its use. Similarly, reading data from a strongly consistent Riak bucket functions exactly like eventually consistent reads.
If that value is not provided for an update operation to an existing object, Riak will reject it. This is because the database assumes that you have not seen the current value and may not know what you’re doing.
Similarly, if that context value is out of date, Riak will also reject update operations. The client must re-read the latest value and supply an update based on that new value, with the new context.
If Riak cannot contact a majority of the servers responsible for the key, the request will fail. Ordinarily, Riak is happy to accept all operations in the interest of high availability and never dropping a write – even in the extreme case of only one server surviving data center outages.
Strong consistency also eliminates object siblings, as it is effectively impossible for the cluster to disagree on the value of an object.
When considering consistency models in an application, it is easy for the logic to quickly become daunting. This is especially true when designing a workflow that leverages both eventually and strongly consistent models. It is, therefore, easiest to begin with a simple use case.
Consider the workflow involved in storing and updating username and password data. In the case of a password update, it is necessary that — at any given time — there be exactly ONE result for a user’s password. Relatedly, it is important to ensure that an update of this value is fully atomic or user experience is substantially degraded. It would be possible to leverage Riak for all the eventually consistent elements of the application and leverage strong consistency for the username and password.
To see how eventual and strong consistency can be combined to solve business problems, let’s take a not-so-hypothetical example from the energy industry.
Imagine you’re collecting massive amounts of geological data for analysis. Each batch of data must be processed by a single instance of your application. Since this processing can take hours, days, or even weeks to complete, it’s expensive if two applications handle the same batch.
Let’s walk through the sequence of events.
- Batch of data arrives for processing.
- The batch is stored in a large object store (like, Riak CS) under a batch ID.
- The batch ID is added to a pending job list in Riak and stored as a set (one of the new Riak Data Types).
This is a classic example of eventual consistency and an illustration of the value of the new Riak Data Types introduced with Riak 2.0. Storing a new batch ID in your database should never fail, even if servers are offline. If multiple applications are adding batch IDs to the pending list at the same time, it’s perfectly reasonable for those lists to temporarily diverge, as long as they can be trivially merged later.
Let’s continue to see where strong consistency comes into play.
- A compute node becomes available to process the data.
- The compute node retrieves the pending job list and picks a batch ID.
- The compute node attempts to create a lock for that batch ID.
This is where strong consistency is required. This lock object should be created in a bucket that is managed by the new strong consistency subsystem in Riak 2.0. If someone else also grabs that batch ID and tries to create another lock object, Riak’s strong consistency logic will reject this second attempt. This compute node will just start over by grabbing a new ID.
To detect crashed jobs, the lock object should be created with basic job data, such as which compute node owns the processing job and what time it was started.
- The compute node asks Riak to add the batch ID to a different set, a running job list.
- The compute node asks Riak to remove the batch ID from the pending list.
- The job runs.
- When completed, the compute node asks Riak to add the batch ID to a completed job list.
- Riak is asked to remove the batch ID from the running list.
- The compute node deletes the lock object (or updates it to reflect the completion of the processing job).
Tradeoffs When Using Strong Consistency
- Blind updates will be rejected, so the client must read the existing value before supplying a new one (except in the case of entirely new keys).
- Write requests may be slightly slower due to coordination overhead.
- If a majority of the servers responsible for a piece of data are unavailable, write requests will fail. Read operations may fail depending on the freshness of the data that is still accessible.
- Secondary indexes (2i) are not yet supported.
- Multi-datacenter replication in Riak Enterprise is not yet supported.
- Using Strong Consistency (for developers)
- Managing Strong Consistency (for operators)
- Strong Consistency (theory & concepts)
Strong Consistency is now available with Riak 2.0. Download Riak 2.0 on our Docs Page.
By: Peter Coppola
We had the opportunity to stop by DATAVERSITY’S NoSQL Now! conference in San Jose last week. I was very impressed with the level of energy and the wide-ranging selection of sessions offered. According to Tony Shaw, the CEO of DATAVERSITY, the organizer of NoSQL Now, registrations were up 15 percent from 2013.
The exhibition hall was packed and lively as attendees jostled between booths. DATAVERSITY did an outstanding job keeping the show floor tightly packed with exhibitors. The industry was well represented by Cloudera (saw “Data is the new bacon” t-shirts), MarkLogic, MongoDB, Oracle and EnterpriseDB – all present as major sponsors. Between conversations, I was able to nab a nifty versatile screw-driver disguised as a pen from DataStax.
NoSQL Now sessions do rely heavily on sponsors, but with such a wide selection of tracks there’s bound to be a topic of interest at any given time slot. I had a choice of the following concurrent sessions at 4:15 p.m. on Wednesday:
- Internet of Things with MongoDB – MongoDB
- Out with MapReduce, In with Spark – DataStax
- Case Studies in Search and Semantics – MarkLogic
- Just the Right Weather for our Company: How We Chose Our Data Stores – The Weather Company
- NoSQL on ACID – EnterpriseDB
I attended The Weather Company’s session – not only was it the only non-vendor presenter, but the company is also a customer and big fan of Riak. The Weather Company manages five data centers that in production handle 25,000 requests per second and distribute 60 GB of data to each data center every 10 minutes. Surya Kangeyan Sivakumar took us through the journey of how The Weather Company selected its data store solutions and how it overcame the mindset of having to use its existing relational database solution just because the company had invested so much in it. Riak was selected, along with other NoSQL solutions, due to the speed and ease at which it could be stood up.
In 2015 Basho looks forward to being a more active participant in NoSQL Now.
Distributed cloud storage software adds additional Amazon S3 compatibility, performance improvements, simplified admin and increased scalability
CAMBRIDGE, Mass. – August 5, 2014 – Basho, the creator and developer of Riak, the industry leading distributed NoSQL database, today introduced Riak CS 1.5 and Riak CS 1.5 Enterprise, Basho’s distributed object storage software. Riak CS (Cloud Storage) is open source software built on top of Riak, used to build public or private clouds, or, as reliable storage to power applications and services. Riak CS 1.5 delivers new features that improve operation, performance and scalability. Basho continues to offer enterprise-class features in Riak CS Enterprise, which includes multi-datacenter replication, world class 24 by 7 support and flexible pricing model.
Companies dealing with large amounts of unstructured data like videos, images and documents are adopting cloud object storage so that data is highly available through a seamlessly scalable architecture. Businesses in industries such as broadcasting and telecommunications are relying on stability, integration functionality and performance of Riak CS to efficiently store, organize and access data while making it simple to manage.
“We offer our customers affordable and scalable cloud storage solutions built on Basho’s Riak CS,” said Makoto Oya, vice director of IDC Frontier. “The enhanced Amazon S3 compatibility and ability to scale well into the multi-petabyte level in Riak CS 1.5 will help us better support the rapid growth we are seeing in our storage business.”
I-NET Corp, a data processing service headquartered in Japan, uses Riak CS for its cloud service called Dream Cloud® and is looking to achieve further cost efficiency thanks to the increased scalability capabilities in Riak CS 1.5.
“Cloud-based object storage is ideal for storing our customer’s growing business-critical data, and we have relied on the excellent performance, cost efficiency and high reliability of Riak CS for the I-NET Dream Cloud®,” said Tsutomu Taguchi, senior managing director, business group of I-NET Corp. “Riak CS already provides us with high availability and now that Riak CS is further optimized to scale, we believe that Riak CS 1.5 delivered by Basho will drive even higher adoption of Dream Cloud®.”
New features enhance performance for object storage to store increasing amounts of data worldwide
Basho delivers new functions in Riak CS that include:
- Additional Amazon S3 compatibility: Expanded storage API compatibility with S3 includes features such as multi-object delete, put object copy, and cache control headers for more flexible integration with content delivery networks (CDNs).
- Performance improvement in garbage collection process: Delivered especially for customers with high rate of object updates and deletes, Riak CS now more quickly reaps objects flagged for garbage collection.
- New, simplified administrative features: New and consolidated admin features make organizational tasks easier for activities such as cluster management, monitoring and troubleshooting.
- Multi-cluster support: Technology preview for increased scalability of Riak CS Enterprise by allowing multiple Riak clusters to reside under a single CS namespace, thereby expanding the maximum capacity of a single cluster.
“Providing the strongest key value solution and object store means responding to customer needs and demands attentively,” said Dave McCrory, CTO of Basho. “With Riak CS 1.5 Enterprise, new features are delivered as requested by our customers. We are committed to make it easier to consume cutting edge versions of Riak and will continue to do this by executing a more iterative approach in how we release Riak.”
Availability and Pricing
Riak CS 1.5 is available immediately for Debian, Ubuntu, FreeBSD, OS X, Red Hat Enterprise Linux, Fedora, SmartOS and Solaris. To view the latest technical documentation or to download Riak CS, visit docs.basho.com/riakcs/latest/.
Basho delivers customized packages for its commercial software, Riak Enterprise and Riak Enterprise Plus, with health checks, as well as options for project-based Professional Services engagements. Full pricing details of Basho commercial software are at http://basho.com/riak-enterprise/#pricing. To request a trial license of Riak CS Enterprise, prospective inquiries can request a Riak CS Tech Talk at http://info.basho.com/SignUpRiakTechTalk.html.
- Basho Website (http://basho.com)
- Basho Blog (http://basho.com/blog/)
- Riak (http://basho.com/riak/)
- Riak CS (http://basho.com/riak-cloud-storage/)
- Riak CS doc (docs.basho.com/riakcs/latest/)
- Additional Resources (http://basho.com/resources/)
- Twitter: @Basho (https://twitter.com/basho)
- LinkedIn (https://www.linkedin.com/company/basho-technologies-inc)
About Basho Technologies
Basho is a distributed systems company dedicated to making software that is highly available, fault-tolerant and easy-to-operate at scale. Basho’s distributed database, Riak, and Basho’s cloud storage software, Riak CS, are used by fast growing Web businesses and by one third of the Fortune 50 to power their critical Web, mobile and social applications and their public and private cloud platforms.
Riak and Riak CS are available open source. Riak Enterprise and Riak CS Enterprise offer enhanced multi-datacenter replication and 24×7 Basho support. For more information, visit basho.com. Basho is headquartered in Cambridge, Massachusetts.
April 28, 2014
On Monday, May 5th, Basho is co-hosting an open source happy hour and networking meetup. Partnering with GoGrid, we will bring together some of the greatest minds in open source to discuss current trends and where open source will be in the future. Drinks and appetizers will be provided. This meetup will take place from 6pm-8pm on May 5th at the 111 Minna Gallery in San Francisco. Space is limited, so be sure and RSVP quickly.
This meetup is also in conjunction with the Open Business Conference. Open Source has become ubiquitous in the enterprise and in the business layer, as more and more organizations are reaping its considerable benefits, including speed, efficiency and cost savings. Open Business Conference 2014 brings together the people who are building and deploying the latest in enabling technologies and solutions and teaches businesses how to put their data to use. Basho will be available to answer any questions about Riak and open source at the GoGrid booth. Open Business Conference takes place May 5-6 at the Palace Hotel in San Francisco.
To learn more about Basho’s partnership with GoGrid (and other partnerships), visit our Partners Page.