Search code examples
riakriak-cs

Riak-CS cluster broken after only 1/3 node failed! The AWS Access Key Id you provided does not exist in our records


I've created 3-node Riak-CS cluster in sandbox, created buckets, uploaded some files, and they were replicated between nodes (I hope intelligent algorithm puts files mainly in partitions on physically different nodes). v_node=2, other replica config is by default.

Now I try situation when one of three nodes fails. I just stopped riak and riak-cs services on one node and getting this from the rest nodes:

s3cmd la s3://
ERROR: S3 error: 403 (InvalidAccessKeyId): The AWS Access Key Id you provided does not exist in our records.

It's supposed that cluster remain operational in case of one node fail, isn't it? Also I tried to mark failed node as Down to be sure cluster state became converged, but this doesn't help.


Solution

  • If you have set your n_val to 2, then there are only 2 replicas of each key. When you shut down one node, one of the replicas for a significant fraction (around 50%) of your keys becomes unavailable.

    Looking at the source for the get_user_with_pbc function, it first tries with the strong_get_user_with_pbc function The strong option for fetching a user record is {pr,all}, {r,all}, {notfound_ok,false}. PR=all means the get request will fail early unless both primary vnodes are available. If one of your replicas is unavailable, that fails as expected with the the pr_val_unsatisfied.

    If the strong option fails, it retries with the weak_get_user_with_pbc function using weak options {r, quorum}, {pr, one}, {notfound_ok,false}. Quorum means (n_val/2 + 1), in this case 2.
    So this still requires one of the primary vnodes to be available, but we must also get a response from a quorum of vnodes, in this case, both the primary and the fallback. If the node has just failed, the first request will find that the fallback is empty, so the get request receives a notfound from the fallback vnode, and the user record from the primary. Since the options include notfound_ok=false, that is 1 valid response while quorum is 2, so the request fails.

    Subsequent queries may complete successfully since the fallback would be populated by read-repair after the first request.

    I think you will find a great many things in Riak and Riak CS that don't seem to work quite right if you reduce n_val below 3. For instance, if you had kept n_val at 3, since a quorum of 3 vnodes is 2, you could still have gotten a valid response to the weak options if one of the primaries was offline and the fallback had not yet been populated.