Updating a galera-cluster to 10.3.15 with MDEV-17458

I just updated a mariadb/galera-cluster to db version 10.3.15. It won't work correctly without at least 2 nodes up, but trying to start any node past the 1st runs into strange error messages, like: .

0 [Warning] WSREP: SST position can't be set in past. Requested: 0, Current:  14422308.
0 [Warning] WSREP: Can't continue.

This bug may be related:

https://jira.mariadb.org/browse/MDEV-17458?attachmentViewMode=list

However, I notice one peculiarity: the requested state is 0, quite possibly because it's lost somewhere along the way, or because I'm experiencing an entirely different issue.

I also know what it should be: the value that it thinks is 'current'. In other words, reality is the exact opposite of what this node thinks is true: the 'current' should be 0, the 'requested' should be 14422308.

In a related issue:

https://jira.mariadb.org/browse/MDEV-19193

someone off-hand comments about deleting some files in order to start from a pristine case, but isn't exactly clear what exactly to do where.

I do not mind starting from the data on one node, ignoring everything on the other nodes and copying everything over.

I tried deleting the following file(s) from the offending nodes. (I believe the data directory they're mentioning is /var/lib/mysql/ on most linux systems):

galera.cache
ib_logfile0
ib_logfile1

This has no effect.

Someone over at this question: Unable to complete SST transfer due to "WSREP: SST position can't be set in past." error suggests changing the SST number on the node that's still OK. But that won't work: I can only start that node if I use the 'galera_new_cluster' script, which resets its SST number to '-1', no matter what it was. If I start it normally, I get an error like this:

[ERROR] WSREP: wsrep::connect(gcomm://<IP1>,<IP2>,<IP3>,...) failed: 7

In other words, there's not enough other nodes online to join the cluster. So in order to change the SST on the primary node, another node needs to be online, but in order to start up the other node, I need to change the SST on the primary? Catch-22, won't work.

It's nice that they fixed the bug, but how do I fix my now broken cluster?

One more question I've asked myself is this: Does this 'SST number' of 14422308 originate from the node that's trying to re-join the cluster, or is it retrieved from the cluster? Apparently, the second thing is true, for even completely reinstalling the secondary node from scratch and trying to re-join the cluster with it will not solve the problem. The exact same error message stays.

Solution

Somehow, the cluster appears to have gotten confused as to its own state. The JOINER nodes in each synchronization step think they have a more advanced state than the DONOR nodes.

The solution to this problem is to trick the cluster; to force it to recognize some node as 'more advanced'.

Suppose we can identify one node that has complete cluster data. Denote this to be the '1st node'. Pick one node to be the 2nd, one to be the 3rd, etc. (These choices can be at-random).

Then, stop mysql on all nodes. Edit the configuration file for the cluster and change the value for 'wsrep_cluster_address' on each node. It should be the following:

+------+---------------------------+
| Node |   wsrep_cluster_address   |
+------+---------------------------+
|    1 | gcomm://                  |
|    2 | gcomm://<IP1>,<IP2>       |
|    3 | gcomm://<IP1>,<IP2>,<IP3> |
+------+---------------------------+

(The pattern continues like this for the fourth and any further nodes in the cluster).

Now remove all cached data from the nodes other than the first. These are the files:

ib_logfile*
grastate.dat
gvwstate.dat
galera.cache

situated in the data dir of the mysql installation. (Example; /var/lib/mysql/ on debian systems).

Then edit the "grastate.dat" file on node #1. In our example, the most advanced state the cluster has yet seen is 14422308. Thus set it to 14422309 (or: old state + 1). Also set safe_to_bootstrap to 0 on all nodes (so we don't accidentally try to bootstrap and lose our seqno, running into the same bug again).

Now start mysql on node #1 (example, via systemd: systemctl start mysql). Once it's running, do the same on node #2. Wait for all the data to transfer (this may take a while depending on the inter-node connection speed and the size of the database in question), then repeat for node 3, and any further nodes.

Afterwards, restore the value for wsrep_cluster_address in every configuration to what it should be (which is equal to the value for the last node).