Basho is pleased to announce the release of Riak CS 1.5, which provides additional performance enhancements and simplifies administration and development with additional admin tools, enhanced S3 compatibility and a technical preview of an architecture to support clusters with very large amounts of storage. Highlights include:
- riak-cs-admin: Consolidates admin operations into a command line tool.
- riak-cs-stanchion: Changes the Stanchion IP and port.
- riak-cs-debug: Packages log, configuration and operating system command files along with Riak command results.
- syslog: Support for standardized syslog output for log aggregation using third-party tools.
S3 API Features
- multi-object delete: Reduces request overhead by supporting multiple deletes in a single request (up to 1,000 keys per request).
- cache control headers: Method for providing caching instructions in a request header.
- PUT object – copy: Creates a copy of an object that already exists in Riak CS.
A full list of S3 API compatibility can be found on the Basho docs site here.
Increased Scalability (Enterprise Feature)
Partly due to limitations with distributed Erlang, prior to 1.5 scalability, Riak CS was limited to several petabytes. CS 1.5 introduces a technical preview of an architecture that allows multiple Riak clusters to reside under a single CS namespace, thereby significantly increasing the amount of storage possible in a cluster. A production-ready version is planned for later this year, with multi-data center support to follow.
Garbage Collection Improvements
In Riak CS, deleted and updated objects are not removed immediately. Instead, a reference is written to a special bucket and later removed by the garbage collection process at regular intervals. CS 1.5 includes several garbage collection enhancements that will benefit customers with a high rate of object deletion or updates.
- concurrent garbage collection worker processes: Speed up the rate of garbage collection with the addition of multiple workers.
- flexible enforcement of leeway interval: In previous versions, updated and deleted objects are reaped only after they reach a predefined time-based leeway interval, which was set when an object was marked for deletion. In CS 1.5 the leeway interval is managed by the garbage collection daemon and can be changed to remove objects sooner, for example, in emergency situations where maximum storage capacity is reached.
Other Notable Enhancements
- faster bucket listings: Optimizations in the OTP xmerl library enables faster bucket listings, in particular for large buckets.
- setting ACLs upon PUT object: Ability to set ACLs via header at PUT object creation is now fully functional.
Riak CS 1.5 is available at: http://docs.basho.com/riakcs/latest/riakcs-downloads/. A full list of changes is available in the release notes. Watch the blog for a detailed discussion of the multi-cluster work.
March 12, 2014
Druid is an open source analytics platform designed for real-time, exploratory queries on large-scale data sets. Druid is useful for use cases requiring interactive and fast exploration of large amounts of data (10s of billions of events added per day, 10s of TB of data added per day) and always-on availability.
The output of Druid’s indexing process is stored in deep storage. Deep storage provides a durable store for segments that feed the Druid query nodes. As long as Druid nodes can see this storage infrastructure and access the segments stored on it, there will be no data loss no matter how many Druid nodes are lost. For deep storage, Druid supports local mounts, HDFS, and S3 or S3-compatible APIs (like Riak CS). S3-compatible API is the default for deep storage.
For a cost-effective, highly available alternative to S3, Riak CS fits nicely with Druid. Druid has released a guide to walk users through how to use Riak CS as deep storage in Druid. The full guide is available here: https://github.com/metamx/druid/wiki/Stand-Alone-With-Riak-CS
February 19, 2014
Basho is excited to announce that Yahoo! JAPAN has launched its new cloud storage service platform for the enterprise market in Japan, powered by Riak CS. Yahoo! JAPAN is the one of the most comprehensive web portals and the most popular search engine in Japan, surpassing Google.
Riak CS is Basho’s cloud storage software that combines Amazon-class economics with the ability to customize and extend. It is built on Riak, Basho’s distributed database that is highly efficient at storing and retrieving objects, even under extreme usage or failure scenarios. For extensibility, Riak CS adds compatibility with the Amazon S3 and OpenStack Swift API.
Yahoo! JAPAN is pushing hard to be number one in transaction value across the Japanese e-commerce marketplace. LOHACO, a popular internet shopping site operated by ASKUL Corporation, has been using Basho’s Riak CS solution via Yahoo! JAPAN as backend storage for almost a year. Riak CS has proven to be a very stable system for LOHACO’s needs.
“Enterprises require storage services that are cost-effective, reliable, and scalable. Yahoo! JAPAN has been able to create a cloud storage solution to meet these needs, as well as deliver access to our large networks and server infrastructure, by developing on Riak and Riak CS,” said Shingo Saito, Cloud Product Manager at Yahoo! JAPAN. “Our partnership with Basho allows us to continue to build features that are beneficial to major enterprises and constantly improve our Cloud Storage Service by integrating with Riak, Riak CS, and future products from Basho.”
Basho is proud of Yahoo! JAPAN’s cloud service. We are excited to be a part of this great milestone.
October 2, 2013
What Is Riak CS?
In May of this year, we posted the top 5 questions we heard from customers and our community about Riak CS; today we’ll take a deeper dive into the technical details, specifically the differences between Riak CS and Riak itself.
Riak CS as Compared to Riak
Both Riak CS and Riak are, at their core, places to store objects. Both are open source and both are designed to be used in a cluster of servers for availability and scalability.
The fundamental distinction between the two is simple: Riak CS can be used for storing very large objects, into the terabyte size range, while Riak is optimized for fast storage and retrieval of small objects (typically no more than a few megabytes).
There are subtle differences; however, that can be obscured by the similarities between the two.
Why Would I Use Riak CS?
Riak CS is used for a variety of reasons. Some examples:
- Private object storage services, for example for companies that want to store sensitive data behind their own firewalls.
- Large binary object storage as part of a voice or video service.
- An integrated component in an OpenStack cloud solution, storing and serving VM images on demand.
Tier 3, Yahoo! Japan, Datapipe, and Turner Broadcasting are just a few of the big names using Riak CS today.
What Does Riak CS Do That Riak Doesn’t?
Riak CS carves large objects into small chunks of data to be distributed throughout a Riak cluster and, when used with Riak CS Enterprise, synchronized with remote data centers.
Riak CS adds compatibility with Amazon’s S3 and OpenStack’s Swift APIs. These offer very different semantics than Riak, and the advanced search capabilities in Riak such as Secondary Indexes and full text search are not available using S3 or Swift clients.
We strongly advise against it, but it is possible to work with Riak’s standard APIs “under the hood” when deploying a Riak CS solution.
Work is actively underway to add a security model to Riak in the upcoming 2.0 release.
Buckets or Buckets?
Users of Riak CS store their objects in virtual containers (called buckets in Amazon S3 parlance, containers in OpenStack).
Riak also relies heavily on buckets for data storage and configuration but, despite the names, these buckets are not the same.
As an example of how this can cause confusion: the replication factor in Riak (the number of times a piece of data is stored in a cluster) is configurable per-bucket. Because Riak’s buckets do not underly the user buckets in Riak CS, this feature cannot be used to create tiered services.
Riak is designed to maximize availability; the price paid for that is delayed consistency when the network is split and clients are writing to both sides of the cluster.
Creating user accounts in Riak CS; however, led to the need for a mechanism to maintain strong consistency. If two people attempt to create user accounts with the same username on either side of a network partition, both cannot be allowed to succeed, or else a conflict will occur that is very difficult to automatically recover from.
Furthermore, user buckets in S3 (and OpenStack APIs as implemented in Riak CS) reside in a global rather than a user-specific namespace, so bucket creation must also be handled carefully.
Riak CS introduced a service named Stanchion that is designed to handle these specific requests to avoid conflicts. Stanchion is a single process running on a single Riak server (thus introducing a single point of failure for user account and bucket creation requests).
While it is possible to deploy Stanchion using common system tools to make a daemon process run in a highly available manner, Basho recommends doing so carefully and testing it thoroughly. Since the only impact of failure is to prevent user and bucket creation, it may be preferable to monitor and alert on failure. If two copies of Stanchion are running due to a network partition, its strong consistency guarantees will be lost.
With strong consistency options targeted for Riak 2.0, expect to see some changes.
Basho offers multi-datacenter replication with its Enterprise software licenses, and Riak CS Enterprise takes full advantage of that feature. Data can be written to one or more clusters in multiple data centers and be synchronized automatically between them.
There are two types of synchronization: real-time, which occurs as objects are written, and full sync, which happens on a periodic basis to compare the full contents of each cluster for any changes to be merged.
One key difference is that Riak CS maintains manifest files to track the chunks it creates, and it is these manifests that are distributed between clusters during real-time sync. The individual chunks are not synchronized until a full sync replication occurs, or until someone requests the file from a remote cluster. The manifest is made active for someone to retrieve the chunks after the original upload to the source cluster is complete.
A common mistake while installing Riak CS is to configure it using information specific to Riak rather than Riak CS. As an example, per the Riak CS installation instructions the relevant backend data store must be configured to
riak_cs_kv_multi_backend, which is forked from Riak’s
riak_kv_multi_backend. Using the latter will cause problems.
Riak (CS) Control
Exposure to Internet
Exposing any database directly to the Internet is risky. Riak, currently lacking any concept of authentication, absolutely must not be accessible to untrusted networks.
Riak CS; however, is designed with Internet access in mind. It is still advisable to place a load balancer or proxy in front of a Riak CS cluster, for example to ease cluster maintenance/upgrades and to provide a central location to log and block potentially hostile access.
Riak CS servers will still have open Riak ports that must be protected from the Internet as you would any Riak servers.
Where to Next for Riak CS?
2013 has been a big year for Riak CS: it was released as open source in the spring, with OpenStack support added this summer. Still, there is much to do.
As mentioned above, improving or replacing Stanchion is a high priority.
We will continue to expand the API coverage for Riak CS. The next major targets are the copy object operations that Amazon S3 and OpenStack Swift offer.
Compression and more granular replication controls are also under consideration for future releases.
By building Riak CS atop the most robust open source distributed database in the world, we’ve created a very operationally friendly, powerful storage solution that can evolve to meet present and future needs. Feel free to give it a try if you aren’t already using it.
If you’re interested in hearing from the engineers who’ve made this software possible (and seeing just how far a highly available data storage solution can take you), join us October 29-30th for RICON West. RICON West is where Basho brings together industry and academia to discuss the rapidly expanding world of distributed systems, including Riak and Riak CS.
September 24, 2013
Basho would like to congratulate Citrix on their launch of CloudPlatform 4.2
This release introduces the first integration of S3-compatible object storage into the CloudPlatform infrastructure, which provides the foundation necessary to enable a tighter integration between CloudPlatform and Riak CS. Cloud-era workloads that require object storage can easily run on CloudPlatform and have transparent access to storage across geographic and logically defined locations.
For mutual customers of Basho and Citrix, this integration can be leveraged to provide either for secondary storage for Infrastructure-as-a-Service (IaaS) components or as a method of providing Storage-as-a-service (StaaS) to their customers.
Basho remains committed to offering a holistic integration for CloudPlatform users who require the scalability, availability, and ease of operation offered by Riak CS.
For more information on the Basho and Citrix partnership (including a video recorded at a meetup), please review:
August 15, 2013
With the launch of Riak CS 1.4, several members of the Basho team have been approached with the question “Why did you build Riak CS?”
When we open sourced Riak CS in March of 2013, the conversation focused on the importance of the community of developers with whom we engage, and participating with this community in a more open fashion.
However, understanding the history of a product can be just as important as understanding the logic behind our go-to-market strategy.
Put simply, Basho is a distributed systems company.
As a company that started with Riak, an open source distributed database, we had an immediate, targeted focus on high availability, fault-tolerance, and linear scalability. These core properties of our database implementation are, in actuality, consistent themes to consider when building any distributed system. And as Riak and Riak Enterprise gained traction in market, several customers began to use their Riak implementation to store larger objects.
With this and other customer feedback in mind, we prototyped Riak CS, which offers all of the benefits of Riak, while also adding the features and functionality required to power large object storage in public or private clouds as well as providing reliable storage for applications and services.
As we built upon this initial prototype, both based on distributed systems themes and customer input, we added an S3-compatible API to Riak CS. This provided a solution for service providers that wanted to offer S3-compatible storage and for customers that wanted to adopt a hybrid-cloud approach to address data sovereignty or redundancy concerns. We also added OpenStack Object Storage API compatibility with the latest Riak CS 1.4 release. Riak CS can now easily interact with multiple IaaS providers, which helps expand our potential user base for both the open source and enterprise product.
However, regardless of feature decisions – either present or in the future – our commitment to providing robust, resilient distributed storage remains.
May 1, 2013
This post looks at five commonly asked questions about Riak CS – simple, available, open source storage built on top of Riak. For more information, please review our full documentation, or sign up for an intro to Riak CS webcast on Friday, May 10.
What is the relationship between Riak and Riak CS?
Riak CS is built on top of Riak, exposing higher-level storage functions including large object support, an S3-compatible API, multi-tenancy, and per-user storage and access statistics. Riak itself provides the replication, availability, fault-tolerance, and underlying storage functions for the Riak CS implementation. Riak and Riak CS should both be installed on every node in your cluster. While Riak and Riak CS could be run on separate virtual or physical nodes, running them on the same machine minimizes intra-cluster bandwidth usage and is the recommended approach. As with Riak, we advise a minimum 5-node cluster.
When objects are uploaded to Riak CS, the object is broken up into smaller chunks which are then streamed, stored, and replicated in the underlying cluster. A manifest is maintained for each object, that points to which blocks comprise the object, and is used to retrieve all blocks and present them to the client on read. In addition to running Riak and Riak CS on each node, Stanchion, a request serializer, must be installed on at least one node in the cluster. This ensures that global entities, such as users and buckets, are unique in the system.
What use cases does Riak CS support that Riak doesn’t?
Riak CS has several features that are not provided in the standalone Riak database. One of the most obvious differences is in the size of objects supported. Riak CS exposes large object support, and includes multi-part upload so you can upload objects as a series of parts. This allows you to upload single objects to the system into the terabyte range. In Riak, the data model is simply key/value; in Riak CS, the key/value model provides the underlying structure for higher-level storage semantics – users, buckets and objects. The Riak CS interface is an S3-compatible HTTP API, allowing you to use existing S3 libraries and tools. In contrast, Riak exposes an HTTP and protobufs API and offers many language-specific clients. Unlike Riak, Riak CS is multi-tenant, with the concept of “users” and per-user reporting on storage and access. This makes it a fit for both private cloud scenarios, with multiple internal users, or as a foundation for a public cloud storage offering.
How does multi-tenancy, authentication and reporting work?
Riak CS exposes an interface for user creation, disablement and credential management. Riak CS can be set so that only administrators can create new users. Administrators also have special privileges including being able to retrieve a list of all users in the system and query the user account information of any user. Once issued credentials, users are able to authenticate, create buckets, upload and download files, retrieve account information, obtain new credentials, or disable their account through the API. Riak CS supports the standard S3 authentication scheme, with support for header and query string authorization.
Riak CS exposes storage, usage and network statistics that support use cases like accounting, subscription, billing or multi-group utilization for public or private clouds. Riak CS will report information on how much storage a user is consuming and the network operations related to access. This data is exposed via an HTTP interface and can be queried on the default timespan “now” or as a range from start time through end time. Access statistics are reported as bytes in and bytes out for both object and bucket operations. Reporting of this information can be scheduled for a set interval or manually triggered.
What’s the difference between Riak CS and Riak CS Enterprise?
Riak CS Enterprise provides multi-datacenter replication on top of Riak CS. For multi-datacenter replication in Riak CS, global information for users, bucket information and manifests are streamed in real-time from a primary implementation to a secondary site so global state is maintained across locations. Objects can then be replicated in either full sync or real-time sync mode. The secondary site will replicate the object as in normal operations. Additional datacenters can be added in order to create availability zones or provide additional data redundancy and locality. Riak CS Enterprise can also be configured for bi-directional replication. Riak CS Enterprise also comes with 24/7, enterprise-level support. More information and pricing can be found here, and full technical information is available on our docs portal. Ready to get started? Sign up for a developer trial of Riak CS Enterprise.
What are your plans for integration of Riak CS with open source compute solutions?
Riak CS provides highly available, distributed storage, making it a natural fit for usage alongside compute solutions. We have partnered with Citrix to collaborate on the integration of Apache CloudStack and Riak CS to create a complete cloud software offering that combines compute and storage in an integrated platform. For more information on our partnership with CloudStack, check out this blog post with the latest update. API and authentication support for OpenStack is also in progress.
April 9, 2013
Riak CS (Cloud Storage) is simple, open source storage software built on top of Riak. s3cmd is a command-line tool for uploading, retrieving, and managing data via an Amazon S3 compatible API.
In this short screencast, we cover the process of installing and configuring s3cmd on a Debian- or Ubuntu-based system. Once installed, we’ll use Amazon’s s3cmd tool to manage buckets and files in Riak CS. You can view the entire screencast below.
Due to the small type used in the screencast, we recommend viewing this video at a high resolution.
You can also learn more about Riak CS here.
April 3, 2013
As you might have heard, we recently open sourced Riak CS, cloud storage built on Riak. You can find all of the code on our GitHub account and download Riak CS here. To help you get started with Riak CS, here are some common use cases.
- Large Object Storage For Applications and Services: Riak CS is built for storing large objects of all types. It is content agnostic so you can store images, text, video, documents, database backups, software binaries, or other data types. Riak CS can store objects into the terabyte size range using the new multipart upload feature. When an object is uploaded, Riak CS breaks it into smaller blocks that are streamed, stored, replicated in the underlying Riak cluster.
- On-Demand Internal Storage Capacity: Riak CS provides highly available storage for internal business units. Built on Riak, Riak CS has a masterless, redundant design that ensures availability and fault-tolerance. Use cases might include document storage or backing for internal applications.
- Storage Layer for Public Clouds/Cloud Services: Riak CS’ flexibility and scalability provide the ideal foundation for building public clouds or cloud services. Capacity can be added by installing Riak CS on a new physical node and joining it with the cluster. Riak automatically redistributes data and ownership so all nodes have equal responsibility, which prevents storage hot spots and decreases the operational burden of adding new nodes. Additionally, Riak CS is multi-tenant, a requirement of most public cloud services today.
- Amazon S3 Compatibility: Riak CS is S3-compatible, making it easy for your developers to be productive quickly. Riak CS can be used with existing S3 clients and libraries. The HTTP REST API supports service, bucket, and object-level operations to easily store and retrieve data. Riak CS makes sense for companies that are trying to provide internal, S3-like services or using a hybrid approach with some public and some private cloud storage.
- Disaster Recovery and Active Backups: Riak CS Enterprise extends Riak CS with multi-datacenter replication. By replicating data across datacenters using either real-time or full-sync, you can maintain redundant storage in case of disaster scenarios. Multi-datacenter replication can also be used to maintain active backups or create availability zones.
February 11, 2013
We are excited to announce Datapipe’s Stratosphere, a globally available, high-performance managed cloud computing platform, leverages Riak Cloud Storage (CS). Riak Cloud Storage provides Datapipe and its customers with highly available, low-latency and S3-compatible storage.
Datapipe offers a single provider solution for managing and securing mission critical IT services, including cloud computing, infrastructure as a service, platform as a service, managed hosting, and colocation.
Stratosphere is Datapipe’s globally available managed cloud computing platform. With the launch of Riak CS to support cloud object storage, Datapipe customers can now access cloud object storage from any solution hosted with Datapipe and adjacent to existing solutions in any Datapipe data center. Stratosphere is designed for enterprise high I/O production environments and can also be used for development, testing and QA environments. Use cases include large-scale marketing campaigns, brand sites and analytics; applications with variable peak demand times and other dynamic workloads; and cloud disaster recovery and geographic redundancy.
Datapipe delivers services from the world’s most influential technical and financial markets including New York metro, Silicon Valley, London, Hong Kong and Shanghai.
Why Riak Cloud Storage at Datapipe?
Datapipe selected Riak Cloud Storage for its low-latency, highly available object storage, operational ease-of-use, and multi-site replication capabilities. After extensively testing solutions from a variety of vendors in the space, Datapipe selected Riak Cloud Storage for a few core reasons:
- Built on years of developing Riak, Riak CS is designed to provide simple, available, distributed cloud storage at any scale.
- Riak CS is compatible with major cloud object storage clients and applications with its S3-based API.
- Riak CS meets the high performance requirements of the Stratosphere cloud-computing platform.
“Riak CS provides the high-performance, distributed datastore we need to deliver a sound foundation for our cloud storage needs now and for many years into the future,” said Ed Laczynski, VP Cloud Strategy, Datapipe.
Be on the lookout for upcoming documentation about using Riak CS-backed functionality on Stratosphere at Datapipe. Riak CS is now available with Datapipe in a limited beta, with an upcoming full release.
For a developer trial of Riak CS, sign up here.