I scaled in a TiDB cluster a few weeks ago to remove a misbehaving TiKV peer.
The peer refused to tombstone even after a full week so I turned the server itself off, left a few days to see if there were any issues, and then ran a forced scale-in to remove it from the cluster.
Even though tiup cluster display {clustername}
no longer shows that server, some of the other TiKV servers keep trying to contact it.
Example log entries:
[2022/10/13 14:14:58.834 +00:00] [ERROR] [raft_client.rs:840] ["connection abort"] [addr=1.2.3.4:20160] [store_id=16025]
[2022/10/13 14:15:01.843 +00:00] [ERROR] [raft_client.rs:567] ["connection aborted"] [addr=1.2.3.4:20160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error=Some(RemoteStopped)] [store_id=16025]
(IP replaced with 1.2.3.4, but the rest is verbatim)
the server in question has been removed from the cluster about a month now and yet the TiKV nodes still think it's there.
How do I correct this?
the store_id
might be a clue - I believe there is a Raft store where the removed server was a leader, but how do I force that store to choose a new leader? The documentation is not clear on this, but I believe the solution has something to do with the PD servers.
Could you first check the store id in pd-ctl to ensure it's in tombstone? For pd-ctl usage, please refer to https://docs.pingcap.com/tidb/dev/pd-control. You can use pd-ctl to delete a store and once it's tombstone, then use pd-ctl remove-tombstone to remove it completely.
For all regions in TiKV, if its leader is disconnected, the followers will re-elect leaders and that dead TiKV node won't be leaders of regions anyway.