Tag Archives: cluster

Cluster Management Improvements in Riak 1.2

September 13, 2012

Last month we released Riak 1.2 , with a number of improvements in Riak stats, the protobufs API, LevelDB backend and repair/recovery capabilities. Riak 1.2 also features a new strategy for making cluster changes like adding and removing nodes. With the new approach, Riak allows you to stage changes, view the impact on the cluster, and then commit or abort changes. The increased visibility lets Riak operators make more informed decisions about when and how to scale up, scale down and upgrade or replace nodes. Additionally, you can now make multiple changes, like adding a number of nodes, at the same time – critical for large-scale clusters.

Pre 1.2 Cluster Management

In prior versions of Riak, users made changes to the cluster using commands under the “riak-admin” syntax. To add or remove a node to the cluster, you would simply call “riak-admin join” or “riak-admin leave,” and the Riak cluster would immediately begin to handoff data and ownership as appropriate. While this approach was simple, it did raise two issues we’ve tried to address with the new cluster management capabilities:

  • Coordinating cluster changes: Prior to Riak 1.2, there was no way to group changes together. Changes were entered sequentially, and if there was more than one change (e.g. joining multiple nodes to a cluster), the first change would happen in a single transition and the remaining changes (e.g. the rest of the joins) would occur together in a second transition. In the case of multiple joins, in the first transition, data is transferred from the cluster to the new node. Then, in the second transition, some of the data transferred to the first new node is then transferred to the other new nodes, wasting network bandwidth and disk space. This proved particularly problematic for production deployments in which nodes were frequently added or removed.
  • Planning: The pre-1.2 approach to cluster management didn’t give you visibility into how your changes would affect the cluster before you made them. For instance, the only way to know how many transfers a join would take would be to start the join and then run “riak-admin ring-status”. Likewise, you couldn’t know what ownership would look like until after the join.

Staged Clustering

We addressed both of the above issues with a new approach we’re calling ‘Staged Clustering’.

In Riak 1.2, instead of joins, leaves, etc. taking place immediately, they’re first staged. After staging cluster changes, you can view how the changes will affect the cluster, seeing how ring ownership will change and how many transfers between nodes will need to occur to complete the transition. After looking at the plan, you can then add or remove changes staged to be committed, scrap the plan, or execute it as is.

Staged Clustering Riak

Staged Clustering High Level Process

The ‘Staged Clustering’ interface is implemented in Riak’s command line tool, riak-admin, under the ‘cluster’ command. Underneath the ‘cluster’ command are subcommands used to stage, view, and commit cluster changes (e.g. to join the current cluster to node dev1, you’d use: ‘riak-admin cluster join dev1’ ). You can read more about the new syntax in the Riak Wiki. Currently, the new approach to cluster management is not implemented in Riak Control, our open-source management and monitoring GUI, but is planned for a later release.

Example

Let’s take a look at how the new cluster management strategy would work in a scenario where we wanted to add three nodes to an existing node (dev1) to form a four-node cluster.

1. View the Current Member Status

First, we call ‘riak-admin member_status’ to get a view of the current cluster, the nodes in it and their current ring ownership:

Member Status Riak

2. Stage Joining New Nodes

Next, we’ll join three nodes (dev2, dev3, dev4) to the cluster using the cluster command.

bash
dev2: riak-admin cluster join dev1
dev3: riak-admin cluster join dev1
dev4: riak-admin cluster join dev1

The joins are now staged for commit.

3. View How Staged Changes Will Affect the Cluster

Now we can use the new ‘riak-admin cluster plan’ command to see the impact of the joins on the cluster, viewing changes to ring ownership and transfers that need to occur.

bash
riak-admin cluster plan

Cluster Management Plan Riak

In the output, we see: the changes staged for commit, the number of resulting cluster transitions (1), how the data will be distributed around the ring after transition (25% on each node), and the number of transfers the transition will take (48 total).

4. Commit Changes

If we want to commit these changes, we use the commit command:

bash
riak-admin cluster commit

These changes start taking place immediately. If we run ‘riak-admin member_status’, we can see the status of the transition. Additionally, we’ve fleshed out the ‘riak-admin transfers’ command to give you much more visibility into active transfers in Riak 1.2.

Other Resources

For more in-depth information on the new cluster management stuff in Riak 1.2, check out this recorded webinar with Basho engineer Joseph Blomstedt and the updated docs.

Why Your Riak Cluster Should Have At Least Five Nodes

April 26, 2012

Here at Basho we want to make sure that your Riak implementations are set up from the beginning to succeed. While you can use the Riak Fast Track to quickly set up a 3-node dev/test environment, we recommend that all production deployments use a minimum of 5 nodes, ensuring you benefit from the architectural principles that underpin Riak’s availability, fault-tolerance and scaling properties.

TL;DR: Deployments of five nodes or greater will provide a foundation for the best performance and growth as the cluster expands. Since Riak scales linearly with the addition of more nodes, users find improved performance, reliability, and throughput with larger clusters. Smaller deployments can compromise the fault-tolerance of the system: with a “sane” replication requirement for availability (we default to three copies), node failures in smaller clusters mean that replication requirements may not be met. This can result in degraded performance and risk of data loss. Additionally, clusters smaller than five nodes mean that with a sane replication requirement of 3, a high percentage (75-100% of the nodes) will need to respond to each request, putting undue load on the cluster that may degrade performance.

Let’s take a closer look in the scenario of a three- and four-node cluster.

Performance and Fault Tolerance Concerns in a 3-Node Cluster

To ensure that the cluster is always available to respond to read and write requests, Basho recommends a “sane default” for data replication: three copies of the data on three different nodes. The default configuration of Riak requires four nodes at minimum to insure no single node holds more than one copy of any particular piece of data. (In future versions of Riak we’ll be able to guarantee that each replica is living on a separate physical node. At this point it’s almost at 100%, but we won’t tell you it’s guaranteed until it is.) While it is possible to change the settings to ensure that the three replicas are on distinct nodes in a three node cluster, you still run into issues of replica placement during a node failure or network partition.

In the event of node failure or a network partition in a three-node cluster, the default requirement for replication remains three but there are only two nodes available to service requests. This will result in degraded performance and carries a risk of data loss.

Performance and Fault Tolerance Concerns in a 4-Node Cluster

With a requirement of three replicas, any one request for a particular piece of data from a 4-node cluster will require a response from 75 – 100% of the nodes in the cluster, which may result in degraded performance. In the event of node failure or a network partition in a 4-node cluster, you are back to the issues we outline above.

What if I want to change the replication default?

If using a different data replication number is right for your implementation, just make sure to use a cluster of N +2 nodes where N is the number of replicas for the reasons outlined above.

Going With 5 Nodes

As you add nodes to a Riak cluster that starts with 5 nodes, the percentage of the cluster required to service each request goes down. Riak scales linearly and predictably from this point on. When a node is taken out of service or fails, the number of nodes remaining is large enough to protect you from data loss.

So do your development and testing with smaller clusters, but when it comes to production, start with five nodes.

Happy scaling.

Shanley

A Preview Of Cluster Membership In Riak 1.0

September 9, 2011

Being a distributed company, we make a lot of videos at Basho that are intended for internal consumption and used to educate everyone on new features, functionality, etc. Every once and a while someone makes a video that’s so valuable it’s hard not to share it with the greater community. This is one of those.

This screencast is a bit on the long side, but it’s entirely worth it. Basho Software Engineer Joe Blomstedt put it together to educate all of Basho on the new cluster membership code, features, and functionality coming in the Riak 1.0 release (due out at the end of the month). We aim to make Riak as operationally-simple as possible to operate at scale, and the choices we make and code we write around cluster membership form the crux of this simplicity.

At the end of this you’ll have a better idea of what Riak’s cluster membership is all about, its major components, how it works in production, new commands that are present Riak 1.0, and much, much more.

And, if you want to dig deeper into what Riak and cluster membership is all about, start here:

* Download Riak 1.0 Pre-release 1
* Riak Core on GitHub
* Where To Start With Riak Core
* Join the Riak Mailing List

It should be noted again that this was intended for internal consumption at Basho, so Joe’s tone and language reflect that in a few sections.

Enjoy, and thanks for being a part of Riak.

The Basho Team

Riak Core – The Coordinator

April 19, 2011

This was originally posted on Ryan Zezeski’s working blog Try Try Try.

At the end of my vnode post I asked the question Where’s the redundancy? There is none in RTS, thus far. Riak Core isn’t magic but rather a suite of tools for building distributed, highly available systems. You have to build your own redundancy. In this post I’ll talk about the coordinator and show how to implement one.

What is a Coordinator?

Logically speaking, a coordinator is just what it sounds like. It’s job is to coordinate incoming requests. It enforces the consistency semantics of N, R and W and performs anti-entropy services like read repair. In simpler terms, it’s responsible for distributing data across the cluster and re-syncing data when it finds conflicts. You could think of vnodes as the things that Get Shit Done (TM) and the coordinators as the other things telling them what to do and overseeing the work. They work in tandem to make sure your request is being handled as best as it can.

To be more concrete a coordinator is a gen_fsm. Each request is handled in it’s own Erlang process. A coordinator communicates with the vnode instances to fulfill requests.

To wrap up, a coordinator

  • coordinates requests
  • enforces the consistency requirements
  • performs anti-entropy
  • is an Erlang process that implements the gen_fsm behavior
  • and communicates with the vnode instances to execute the request

Implementing a Coordinator

Unlike the vnode, Riak Core doesn’t define a coordinator behavior. You have to roll your own each time. I used Riak’s get and put coordinators for guidance. You’ll notice they both have a similar structure. I’m going to propose a general structure here that you can use as your guide, but remember that there’s nothing set in stone on how to write a coordinator.

Before moving forward it’s worth mentioning that you’ll want to instantiate these coordinators under a simple_one_for_one supervisor. If you’ve never heard of simple_one_for_one before then think of it as a factory for Erlang processes of the same type. An incoming request will at some point call supervisor:start_child/2 to instantiate a new FSM dedicated to handling this specific request.

init(Args) -> {ok, InitialState, SD, Timeout}

erlang
Args = term()
InitialState = atom()
SD = term()
Timeout = integer()

This is actually part of the gen_fsm behavior, i.e. it’s a callback you must implement. It’s job is to specify the InitialState name and it’s data (SD). In this case you’ll also want to specify a Timeout value of 0 in order to immediately go to the InitialState, prepare.

A get coordinator for RTS is passed four arguments.

  1. ReqId: A unique id for this request.
  2. From: Who to send the reply to.
  3. Client: The name of the client entity — the entity that is writing log events to RTS.
  4. StatName: The name of the statistic the requester is interested in.

All this data will be passed as a list to init and the only work that needs to be done is to build the initial state record and tell the FSM to proceed to the prepare state.

erlang
init([ReqId, From, Client, StatName]) ->
SD = #state{req_id=ReqId,
from=From,
client=Client,
stat_name=StatName},
{ok, prepare, SD, 0}.

The write coordinator for RTS is very similar but has two additional arguments.

  1. Op: The operation to be performed, one of set, append, incr,
    incrby or sadd.
  2. Val: The value of the operation. For the incr op this is undefined.

Here is the code.

erlang
init([ReqID, From, Client, StatName, Op, Val]) ->
SD = #state{req_id=ReqID,
from=From,
client=Client,
stat_name=StatName,
op=Op,
val=Val},
{ok, prepare, SD, 0}.

prepare(timeout, SD0) -> {next_state, NextState, SD, Timeout}

erlang
SD0 = SD = term()
NextState = atom()
Timeout = integer()

The job of prepare is to build the preference list. The preference list is the preferred set of vnodes that should participate in this request. Most of the work is actually done by riak_core_util:chash_key/1 and riak_core_apl:get_apl/3. Both the get and write coordinators do the same thing here.

  1. Calculate the index in the ring that this request falls on.
  2. From this index determine the N preferred partitions that should handle the request.

Here is the code.

erlang
prepare(timeout, SD0=#state{client=Client,
stat_name=StatName}) ->
DocIdx = riak_core_util:chash_key({list_to_binary(Client),
list_to_binary(StatName)}),
Prelist = riak_core_apl:get_apl(DocIdx, ?N, rts_stat),
SD = SD0#state{preflist=Prelist},
{next_state, execute, SD, 0}.

The fact that the key is a two-tuple is simply a consequence of the fact that Riak Core was extracted from Riak and some of it’s key-value semantics crossed during the extraction. In the future things like this may change.

execute(timeout, SD0) -> {next_state, NextState, SD}

erlang
SD0 = SD = term()
NextState = atom()

The execute state executes the request by sending commands to the vnodes in the preflist and then putting the coordinator into a waiting state. The code to do this in RTS is really simple; call the vnode command passing it the preference list. Under the covers the vnode has been changed to use riak_core_vnode_master:command/4 which will distribute the commands across the Preflist for you. I’ll talk about this later in the post.

Here’s the code for the get coordinator.

erlang
execute(timeout, SD0=#state{req_id=ReqId,
stat_name=StatName,
preflist=Prelist}) ->
rts_stat_vnode:get(Prelist, ReqId, StatName),
{next_state, waiting, SD0}.

The code for the write coordinator is almost identical except it’s parameterized on Op.

erlang
execute(timeout, SD0=#state{req_id=ReqID,
stat_name=StatName,
op=Op,
val=undefined,
preflist=Preflist}) ->
rts_stat_vnode:Op(Preflist, ReqID, StatName),
{next_state, waiting, SD0}.

waiting(Reply, SD0) -> Result

erlang
Reply = {ok, ReqID}
Result = {next_state, NextState, SD}
| {stop, normal, SD}
NextState = atom()
SD0 = SD = term()

This is probably the most interesting state in the coordinator as it’s job is to enforce the consistency requirements and possibly perform anti-entropy in the case of a get. The coordinator waits for replies from the various vnode instances it called in execute and stops once it’s requirements have been met. The typical shape of this function is to pattern match on the Reply, check the state data SD0, and then either continue waiting or stop depending on the current state data.

The get coordinator waits for replies with the correct ReqId, increments the reply count and adds the Val to the list of Replies. If the quorum R has been met then return the Val to the requester and stop the coordinator. If the vnodes didn’t agree on the value then return all observed values. In this post I am punting on the conflict resolution and anti-entropy part of the coordinator and exposing the inconsistent state to the client application. I’ll implement them in my next post. If the quorum hasn’t been met then continue waiting for more replies.

erlang
waiting({ok, ReqID, Val}, SD0=#state{from=From, num_r=NumR0, replies=Replies0}) ->
NumR = NumR0 + 1,
Replies = [Val|Replies0],
SD = SD0#state{num_r=NumR,replies=Replies},
if
NumR =:= ?R ->
Reply =
case lists:any(different(Val), Replies) of
true ->
Replies;
false ->
Val
end,
From ! {ReqID, ok, Reply},
{stop, normal, SD};
true -> {next_state, waiting, SD}
end.

The write coordinator has things a little easier here cause all it cares about is knowing that W vnodes executed it’s write request.

erlang
waiting({ok, ReqID}, SD0=#state{from=From, num_w=NumW0}) ->
NumW = NumW0 + 1,
SD = SD0#state{num_w=NumW},
if
NumW =:= ?W ->
From ! {ReqID, ok},
{stop, normal, SD};
true -> {next_state, waiting, SD}
end.

What About the Entry Coordinator?

Some of you may be wondering why I didn’t write a coordinator for the entry vnode? If you don’t remember this is responsible for matching an incoming log entry and then executing it’s trigger function. For example, any incoming log entry from an access log in combined logging format will cause the total_reqs stat to be incremented by one. I only want this action to occur at maximum once per entry. There is no notion of N. I could write a coordinator that tries to make some guarentees about it’s execution but for now I’m ok with possibly dropping data occasionally.

Changes to rts.erl and rts_stat_vnode

Now that we’ve written coordinators to handle requests to RTS we need to refactor the old rts.erl and rts_stat_vnode. The model has changed from rts calling the vnode directly to delegating the work to rts_get_fsm which will call the various vnodes and collect responses.

“`text
rts:get —-> rts_stat_vnode:get (local)

                                                       /--> stat_vnode@rts1

rts:get —-> rts_get_fsm:get —-> riak_stat_vnode:get –|—> stat_vnode@rts2
–> stat_vnode@rts3
“`

Instead of performing a synchronous request the rts:get/2 function now calls the get coordinator and then waits for a response.

erlang
get(Calient, StatName) ->
{ok, ReqID} = rts_get_fsm:get(Client, StatName),
wait_for_reqid(ReqID, ?TIMEOUT).

The write requests underwent a similar refactoring.

“`erlang
do_write(Client, StatName, Op) ->
{ok, ReqID} = rts_write_fsm:write(Client, StatName, Op),
wait_for_reqid(ReqID, ?TIMEOUT).

do_write(Client, StatName, Op, Val) ->
{ok, ReqID} = rts_write_fsm:write(Client, StatName, Op, Val),
wait_for_reqid(ReqID, ?TIMEOUT).
“`

The rts_stat_vnode was refactored to use riak_core_vnode_master:command/4 which takes a Preflist, Msg, Sender and VMaster as argument.

Preflist: The list of vnodes to send the command to.

Msg: The command to send.

Sender: A value describing who sent the request, in this case the coordinator. This is used by the vnode to correctly address the reply message.

VMaster: The name of the vnode master for the vnode type to send this command to.

erlang
get(Preflist, ReqID, StatName) ->
riak_core_vnode_master:command(Preflist,
{get, ReqID, StatName},
{fsm, undefined, self()},
?MASTER).

Coordinators in Action

Talk is cheap, let’s see it in action. Towards the end of the vnode post I made the following statement:

“If you start taking down nodes you’ll find that stats start to disappear.”

One of the main objectives of the coordinator is to fix this problem. Lets see if it worked.

Build the devrel

bash
make
make devrel

Start the Cluster

bash
for d in dev/dev*; do $d/bin/rts start; done
for d in dev/dev{2,3}; do $d/bin/rts-admin join rts1@127.0.0.1; done

Feed in Some Data

bash
gunzip -c progski.access.log.gz | head -100 | ./replay --devrel progski

Get Some Stats

text
./dev/dev1/bin/rts attach
(rts1@127.0.0.1)1> rts:get("progski", "total_reqs").
{ok,97}
(rts1@127.0.0.1)2> rts:get("progski", "GET").
{ok,91}
(rts1@127.0.0.1)3> rts:get("progski", "total_sent").
{ok,445972}
(rts1@127.0.0.1)4> rts:get("progski", "HEAD").
{ok,6}
(rts1@127.0.0.1)5> rts:get("progski", "PUT").
{ok,not_found}
(rts1@127.0.0.1)6> rts:get_dbg_preflist("progski", "total_reqs").
[{730750818665451459101842416358141509827966271488,
'rts3@127.0.0.1'},
{753586781748746817198774991869333432010090217472,
'rts1@127.0.0.1'},
{776422744832042175295707567380525354192214163456,
'rts2@127.0.0.1'}]
(rts1@127.0.0.1)7> rts:get_dbg_preflist("progski", "GET").
[{274031556999544297163190906134303066185487351808,
'rts1@127.0.0.1'},
{296867520082839655260123481645494988367611297792,
'rts2@127.0.0.1'},
{319703483166135013357056057156686910549735243776,
'rts3@127.0.0.1'}]

Don’t worry about what I did on lines 6 and 7 yet, I’ll explain in a second.

Kill a Node

text
(rts1@127.0.0.1)8> os:getpid().
"91461"
Ctrl^D
kill -9 91461

Verify it’s Down

bash
$ ./dev/dev1/bin/rts ping
Node 'rts1@127.0.0.1' not responding to pings.

Get Stats on rts2

You’re results my not exactly match mine as it depends on which vnode instances responded first. The coordinator only cares about getting R responses.

text
./dev/dev2/bin/rts attach
(rts2@127.0.0.1)1> rts:get("progski", "total_reqs").
{ok,97}
(rts2@127.0.0.1)2> rts:get("progski", "GET").
{ok,[not_found,91]}
(rts2@127.0.0.1)3> rts:get("progski", "total_sent").
{ok,445972}
(rts2@127.0.0.1)4> rts:get("progski", "HEAD").
{ok,[not_found,6]}
(rts2@127.0.0.1)5> rts:get("progski", "PUT").
{ok,not_found}

Let’s Compare the Before and After Preflist

Notice that some gets on rts2 return a single value as before whereas others return a list of values. The reason for this is because the Preflist calculation is now including fallback vnodes. A fallback vnode is one that is not on it’s appropriate physical node. Since we killed rts1 it’s vnode requests must be routed somewhere else. That somewhere else is a fallback vnode. Since the request-reply model between the coordinator and vnode is asynchronous our reply value will depend on which vnode instances reply first. If the instances with values reply first then you get a single value, otherwise you get a list of values. My next post will improve this behavior slightly to take advantage of the fact that we know there are still two nodes with the data and there should be no reason to return conflicting values.

text
(rts2@127.0.0.1)6> rts:get_dbg_preflist("progski", "total_reqs").
[{730750818665451459101842416358141509827966271488,
'rts3@127.0.0.1'},
{776422744832042175295707567380525354192214163456,
'rts2@127.0.0.1'},
{753586781748746817198774991869333432010090217472,
'rts3@127.0.0.1'}]
(rts2@127.0.0.1)7> rts:get_dbg_preflist("progski", "GET").
[{296867520082839655260123481645494988367611297792,
'rts2@127.0.0.1'},
{319703483166135013357056057156686910549735243776,
'rts3@127.0.0.1'},
{274031556999544297163190906134303066185487351808,
'rts2@127.0.0.1'}]

In both cases either rts2 or rts3 stepped in for the missing rts1. Also, in each case, one of these vnodes is going to return not_found since it’s a fallback. I added another debug function to determine which one.

text
(rts2@127.0.0.1)8> rts:get_dbg_preflist("progski", "total_reqs", 1).
[{730750818665451459101842416358141509827966271488,
'rts3@127.0.0.1'},
97]
(rts2@127.0.0.1)9> rts:get_dbg_preflist("progski", "total_reqs", 2).
[{776422744832042175295707567380525354192214163456,
'rts2@127.0.0.1'},
97]
(rts2@127.0.0.1)10> rts:get_dbg_preflist("progski", "total_reqs", 3).
[{753586781748746817198774991869333432010090217472,
'rts3@127.0.0.1'},
not_found]
(rts2@127.0.0.1)11> rts:get_dbg_preflist("progski", "GET", 1).
[{296867520082839655260123481645494988367611297792,
'rts2@127.0.0.1'},
91]
(rts2@127.0.0.1)12> rts:get_dbg_preflist("progski", "GET", 2).
[{319703483166135013357056057156686910549735243776,
'rts3@127.0.0.1'},
91]
(rts2@127.0.0.1)13> rts:get_dbg_preflist("progski", "GET", 3).
[{274031556999544297163190906134303066185487351808,
'rts2@127.0.0.1'},
not_found]

Notice the fallbacks are at the end of each list. Also notice that since we’re on rts2 that total_reqs will almost always return a single value because it’s fallback is on another node whereas GET has a local fallback and will be more likely to return first.

Conflict Resolution & Read Repair

In the next post I’ll be making several enhancements to the get coordinator by performing basic conflict resolution and implementing read repair.

Ryan

Follow Up To Riak Operations Webinar

April 15, 2011

Thanks to all who attended Thursday’s webinar on Riak Operations. If you couldn’t make it you can find a screencast of the webinar below. You can also check out the slides directly.

This webinar took users through many of the different operational aspects of running a production Riak cluster. Topics covered include: basic Riak configuration, monitoring, performance, backups, and much more.

Grant

Free Webinar – Riak Operations – April 14 @ 2PM Eastern

April 7, 2011

Riak, like any other database, is not a piece of your infrastructure to be taken lightly. There are a variety of operational concerns that must be taken into account to ensure both reliability and performance of your Riak cluster. Deployments to different types of environments means knowing the nuances of each environment and being able to adapt to its performance characteristics. This webinar will cover many of the best practices in maintaining your Riak cluster in a variety of environments, as well as monitoring your cluster and understanding how to diagnose performance issues.

We invite you to join us for a free webinar on Thursday, April 14 at 2:00PM Eastern Time (UTC-4) to talk about Riak Operations. In this webinar, we’ll discuss:

  • Basic Riak Operations (config files, logs, etc)
  • Monitoring Riak
  • Backups
  • Understanding Riak Performance
  • Riak in Virtualized Environments

We’ll address the above topics as well as many others. The presentation will last 30 to 45 minutes, with time for questions at the end.

Registration is now closed. 

Riak Fast Track Revisited

May 27, 2010

You may remember a few weeks back we posted a blog about a new feature on the Riak Wiki called The Riak Fast Track. To refresh your memory, “The Fast Track is a 30-45 minute interactive tutorial that introduces the basics of using and understanding Riak, from building a three node cluster up through MapReduce.”

This post is intended to offer some insight into what we learned from the launch and what we are aiming to do moving forward to build out the Fast Track and other similar resources.

The Numbers

The Fast Track and accompanying blog post were published on Tuesday, May 5th. After that there was a full week to send in thoughts, comments, and reviews. In that time period:

  • I received 24 responses (my hope was for >15)
  • Of those 24, 10 had never touched Riak before
  • Of those 24, 13 said they were already planning on using Riak in production or after going through the
    Fast Track were now intending to use Riak in production in some capacity

The Reviews

Most of the reviews seemed to follow a loose template: “Hey. Thanks for this! It’s a great tool and I learned a lot. That said, here is where I think you can improve…”

Putting aside the small flaws (grammar, spelling, content flow, etc.), there emerged numerous recurring topics:

  • Siblings, Vector Clocks, Conflict Resolution, Concurrent Updates…More details please. How do they work in Riak and what implications do they have?
  • Source can be a pain. Can we get a tutorial that uses the binaries?
  • Curl is great, but can we get an Erlang/Protocol Buffers/language specific tutorial?
  • I’ve heard about Links in Riak but there is nothing in the Fast Track about it. What gives!?
  • Pictures, Graphics and Diagrams would be awesome. There is all this talk of Rings, Clusters, Nodes, Vnodes, Partitions, Vector Clocks, Consistent Hashing, etc. Some basic diagrams would go along way in helping me grasp the Riak architecture.
  • Short, concise screencasts are awesome. More, please!
  • The Basic API page is great but it seems a bit…crowded. I know they are all necessary but do we really need all this info about query parameters, headers and the like in the tutorial?

Another observation about the nature of the reviews: they were very long and very detailed. It would appear that a lot of you spent considerable time crafting thoughtful responses and, while I was expecting this to some extent, I was still impressed and pleasantly surprised.

This led me to draw two conclusions:

  1. People were excited by the idea of bettering the Fast Track for future Riak users to come
  2. Swag is a powerful motivator

Now, I’m going to be a naïve Community Manager and let myself believe that the Riak Developer Community maintains a high level of programmer altruism. The swag was just an afterthought, right?

So What Did We Change?

We have been doing the majority of the editing and enhancing on the fly. This process is still ongoing and I don’t doubt that some of you will notice elements still present that you thought needed changing. We’ll get there. I promise.

Here is a partial list of what was revised:

  • The majority of changes were small and incremental, fixing a phrase here, tweaking a sentence there. Many small fixes and tweaks go a long way!
  • The most-noticeable alterations are on the MapReduce page, where we worked a lot to make it flow better and more interactive. This continues to be improved.
  • The Basic API Operations page got some love in the form of simplification. After reading your comments, we went back and realized that we were probably throwing too much information at you too fast.
  • There are now several graphics relating to the Riak Ring and Consistent Hashing. There will be more.

And, as I said, this is still ongoing.

Thank You!

I’ve added a Thank You page to the end of the Fast Track to serve as a permanent shout-out to those who help revise and refine the Fast Track. (I hope to see this list grow, too.) Future newcomers to Riak will surely benefit from your time, effort, and input.

What is Next?

Since its release, the Fast Track tutorial has become the second most-visited page on the Riak Wiki, second only to the wiki.basho.com itself. This tells us here at Basho that there is a need for more tools and tutorials like this. So our intention is to expand this as far as time permits.

In the short term, we plan to add a link-walking page. This was scheduled for the original iteration of the Fast Track but was scrapped because we didn’t have time to assemble all the components. The MapReduce section is going to get more interactive, too.

Another addition will be content and graphics that demonstrate Riak’s fault-tolerance and ability to withstand node outages.

We also want to get more specific with languages. Right now, it uses curl over HTTP. This is great but language-specific makes tremendous sense, and the only preventing us from doing this is time. The ultimate vision is to expand transform the Fast Track into a sort of “choose your own adventure” module, such that if a Ruby dev who prefers Debian shows up at wiki.basho.com without having ever heard of Riak, they can click a few links and arrive at a tutorial that shows them how to spin up three nodes of Riak on Debian and query it through Ripple. Erlang, Ruby, Javascript and Java are at the top of the list.

But, we have a long way to go before we get there, so stay tuned for continuous enhancements and improvements. And if you’re at all at interested in helping develop and expand the Fast Track (say, perhaps, outlining an up-and-running tutorial for for Riak+JavaScript) don’t hesitate to shoot an email to mark@basho.com.

Mark

Community Manager