Search code examples
networkingriak

Many keys in riak kv get 404 or 503 error first time


I have two datacentres named site1 and site2, and have 10 nodes at each center. site1: 10.10.1.1 .. 10.10.1.10 site2: 10.10.2.1 .. 10.10.2.10

The network between site1 and site2 is using a fiber, and the latency below 1 ms.

/etc/hosts

10.10.1.1  node01.server
10.10.1.2  node02.server
10.10.1.3  node03.server
10.10.1.4  node04.server
10.10.1.5  node05.server
10.10.1.6  node06.server
10.10.1.7  node07.server
10.10.1.8  node08.server
10.10.1.9  node09.server
10.10.1.10 node10.server

site2 /etc/hosts

10.10.2.1  node01.server
10.10.2.2  node02.server
10.10.2.3  node03.server
10.10.2.4  node04.server
10.10.2.5  node05.server
10.10.2.6  node06.server
10.10.2.7  node07.server
10.10.2.8  node08.server
10.10.2.9  node09.server
10.10.2.10 node10.server

My migration process is:

  1. stop riak at 10.10.1.10
  2. copy files in leveldb,ring,cluster_meta etc. to 10.10.2.10
  3. change /etc/hosts like below(all other 9 riak nodes as site1 are running this moment except 10.10.1.10)
10.10.1.1  node01.server
10.10.1.2  node02.server
10.10.1.3  node03.server
10.10.1.4  node04.server
10.10.1.5  node05.server
10.10.1.6  node06.server
10.10.1.7  node07.server
10.10.1.8  node08.server
10.10.1.9  node09.server
10.10.2.10 node10.server
  1. start riak ad 10.10.2.10 with same nodename
  2. wait for riak transfer complited
  3. repeat other nodes at site1 step by step #1 to #5

Now my riak cluster are all in site2 now. But now we encounted the not found problems.

I using curl to fetch a key from riak cluster, the bash like below:

curl -w'\n' -i http://node04.server:8098/buckets/spgs_gamelog_40_1968_0/keys/20049355203

first time, command above returns not found error like below:

HTTP/1.1 404 Object Not Found
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Date: Wed, 22 May 2024 07:04:02 GMT
Content-Type: text/plain
Content-Length: 10

not found

OR

HTTP/1.1 503 Service Unavailable
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Date: Wed, 22 May 2024 07:16:04 GMT
Content-Type: text/plain
Content-Length: 25

R-value unsatisfied: 1/2

repeat above command, the key have some values like below:

HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzGDKBVI8th99Mm94rOFjWX6EJYMpkTGPlaFwTvR9viwA
x-riak-index-t_int: 1709645937049
Vary: Accept-Encoding
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Link: </buckets/spgs_gamelog_40_1968_0>; rel="up"
Last-Modified: Tue, 05 Mar 2024 13:38:57 GMT
ETag: "47x7CEP3nrsLeJkswEXcTw"
Date: Wed, 22 May 2024 07:04:30 GMT
Content-Type: application/json
Content-Length: 1032

{"user_id":"240440215","wriak_t":1709645937050}

I have tried to add parameters like notfound_ok=false&r=3&pr=1 but it does not work!

After check the console.log file. I discoved below warning messages:

riak_kv_vnode:log_key_amnesia:4493 Inbound clock entry for <<157,70,93,96,209,34,165,36>> in <<"spgs_gamelog_40_1968_0">>/<<"20049355203">> greater than local.Epochs: {In:70316435 Local:0}. Counters: {In:1 Local:0}

Does riak client have any way to keep the fetch key operation successfully after read repair immediately. Or if the leveldb partitions are corrupted how can I do?

Added on 2024-05-22:

find . -name "LOG" -exec grep -l 'Compaction error' {} \; cannot find any errors in leveldb folders.

Added on 2024-05-23: Post some new warning logs like below. Did this means my LAN had some latency?

2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:join_mbox_replies:1226 soft-limit mailbox check timeout
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:check_mailboxes:1192 Mailbox soft-load poll timout 100
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {633697975561446187189878970435575840553939501056,'[email protected]'} did not return in time
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {630843480176034267427762398496676850281174007808,'[email protected]'} did not return in time
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {627988984790622347665645826557777860008408514560,'[email protected]'} did not return in time

Added on 05-27: The old keys before migration and new keys after migration both have this issue!

  • Did the MTU have affected? My original cluster which in siteA MTU are all 1500 and new cluster which in siteB MTU are 9000, but MTU of servers which deploied applications that written and search keys are 1500.

Solution

  • From what you said from before you edited your question, it seems that as part of your migration you are having the single Riak KV 2.9.10 cluster span two different datacentres.

    I assume that you are updating the "/etc/hosts" files on all nodes such that the nodes in both datacentres resolve a given nodename to a single specific node (i.e. you update all 10 nodes at the same time to say that "[email protected]" has a new IP of 1.2.3.4).

    This is generally a bad idea unless the connection between sites is very, very fast (we're talking dark fibre, 1ms latency). Odds are for your case the connection is too slow. The error message you posted showing clock entry latency seems to back this up.

    The two recommended methods of migrating to a new datacentre are:

    • MDC replication from Site A to Site B (this has the advantage of simplicity and no down time, but at the cost of speed).
    • Backup all nodes at the same time at Site A, and restore all nodes at the same time at Site B (this has the advantage of speed, but at the cost of downtime and simplicity).

    Given the limited information, to potentially solve your problem move all the remaining nodes at the same time. You might also then want to run partition repairs.