How do I connect existing TiKV nodes to a new cluster of PDs in TiDB?

I had a working TiDB instance running in gcloud, deployed using the tidb-ansible scripts. I wanted to replace the PD nodes with new ones, so I destroyed and replaced those. The PD cluster comes up ok now, but when I try to start the TiKV nodes, I get this error:

2018/02/28 01:42:08.091 node.rs:191: [ERROR] cluster ID mismatch: local_id 6520261967047847245 remote_id 6527407705559138241. you are trying to connect to another cluster, please reconnect to the correct PD

There's a good explanation of the error in the TiDB FAQs (https://pingcap.com/docs/FAQ/):

-- The cluster ID mismatch message is displayed when starting TiKV. --

This is because the cluster ID stored in local TiKV is different from the cluster ID specified by PD. When a new PD cluster is deployed, PD generates random cluster IDs. TiKV gets the cluster ID from PD and stores the cluster ID locally when it is initialized. The next time when TiKV is started, it checks the local cluster ID with the cluster ID in PD. If the cluster IDs don’t match, the cluster ID mismatch message is displayed and TiKV exits.

If you previously deploy a PD cluster, but then you remove the PD data and deploy a new PD cluster, this error occurs because TiKV uses the old data to connect to the new PD cluster.

But no explanation about how to fix the problem. Is there a way to destroy the local cluster ID on the TiKV instance so it can hook up with the PD properly?

Will PD be able to coordinate my existing TiKV nodes (with existing data) if I can get them to talk again?

Solution

Is there a way to destroy the local cluster ID on the TiKV instance so it can hook up with the PD properly?

To hook up the TiKV instance with PD properly, you can change the PD’s cluster ID. See the following steps and example.

Will PD be able to coordinate my existing TiKV nodes (with existing data) if I can get them to talk again?

Yes, it will.

You can use "pd-recover" to fix this issue.

Step 1. Run the new pd-server with the cluster ID: 6527407705559138241.

Step 2. Change the cluster ID to 6520261967047847245.

./pd-recover --endpoints "http://the-new-pd-server:port" --cluster-id 6520261967047847245 --alloc-id 100000000

Step 3. Restart the PD server.

Note that PD has a MONOTONIC unique ID allocator which is the alloc-id. All region IDs and peer IDs are generated by the allocator. So please make sure the ID you choose for step 2 is big enough that no existing IDs can exceed, otherwise it will corrupt TiKV.

Example

neil:bin/ (master) $ ./pd-server &
[1] 32718
2018/03/01 10:51:01.343 util.go:59: [info] Welcome to Placement Driver (PD).                                                                                                                                                                                                               
2018/03/01 10:51:01.343 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:01.343 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:01.343 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:01.343 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:01.343 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:01.344 server.go:87: [info] PD config - Config({FlagSet:0xc420177500 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:01.346 server.go:114: [info] start embed etcd
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:01.347 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:01 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:01.408 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial advertise peer URLs = http://127.0.0.1:2380]
2018/03/01 10:51:01.409 log.go:84: [info] etcdserver: [initial cluster = pd=http://127.0.0.1:2380]
2018/03/01 10:51:01.475 log.go:84: [info] etcdserver: [starting member b71f75320dc06a6c in cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 0]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]]
2018/03/01 10:51:01.475 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 1]
2018/03/01 10:51:01.587 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:01.631 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:01.632 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:01.633 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 1]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 2]
2018/03/01 10:51:03.476 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [setting up the initial cluster version to 3.2]
2018/03/01 10:51:03.477 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:03.477 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:03.478 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:03.480 etcdutil.go:125: [warning] check etcd http://127.0.0.1:2379 status, resp: &{cluster_id:2037210783374497686 member_id:13195394291058371180 revision:1 raft_term:2  3.2.4 24576 13195394291058371180 3 2}, err: <nil>, cost: 1.84566554s
2018/03/01 10:51:03.489 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:03.489 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:03.500 server.go:174: [info] init cluster id 6527803384525484955
2018/03/01 10:51:03.579 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:06.578778001 +0800 CST
2018/03/01 10:51:03.579 leader.go:249: [info] PD cluster leader pd is ready to serve

neil:bin/ (master) $ ./pd-recover --endpoints "http://localhost:2379" --alloc-id 100000000 --cluster-id 66666666666
recover success! please restart the PD cluster
neil:bin/ (master) $ kill 32718
2018/03/01 10:51:35.258 server.go:228: [info] closing server
2018/03/01 10:51:35.258 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:35.258 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:35.259 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:35.291 server.go:246: [info] close server
2018/03/01 10:51:35.291 main.go:89: [info] Got signal [15] to exit.
[1]  + 32718 done       ./pd-server
neil:bin/ (master) $ ./pd-server
2018/03/01 10:51:40.007 util.go:59: [info] Welcome to Placement Driver (PD).
2018/03/01 10:51:40.007 util.go:60: [info] Release Version: 0.9.0
2018/03/01 10:51:40.007 util.go:61: [info] Git Commit Hash: 651d0dd52a46b7990d0cd74d33f2f10194d46565
2018/03/01 10:51:40.007 util.go:62: [info] Git Branch: namespace
2018/03/01 10:51:40.007 util.go:63: [info] UTC Build Time:  2017-09-13 05:30:13
2018/03/01 10:51:40.007 metricutil.go:83: [info] disable Prometheus push client
2018/03/01 10:51:40.007 server.go:87: [info] PD config - Config({FlagSet:0xc4200771a0 Version:false ClientUrls:http://127.0.0.1:2379 PeerUrls:http://127.0.0.1:2380 AdvertiseClientUrls:http://127.0.0.1:2379 AdvertisePeerUrls:http://127.0.0.1:2380 Name:pd DataDir:default.pd InitialCluster:pd=http://127.0.0.1:2380 InitialClusterState:new Join: LeaderLease:3 Log:{Level: Format:text DisableTimestamp:false File:{Filename: LogRotate:true MaxSize:0 MaxDays:0 MaxBackups:0}} LogFileDeprecated: LogLevelDeprecated: TsoSaveInterval:3s Metric:{PushJob:pd PushAddress: PushInterval:0s} Schedule:{MaxSnapshotCount:3 MaxStoreDownTime:1h0m0s LeaderScheduleLimit:64 RegionScheduleLimit:12 ReplicaScheduleLimit:16} Replication:{MaxReplicas:3 LocationLabels:[]} QuotaBackendBytes:0 AutoCompactionRetention:1 TickInterval:500ms ElectionInterval:3s configFile: WarningMsgs:[] nextRetryDelay:1000000000 disableStrictReconfigCheck:false})
2018/03/01 10:51:40.010 server.go:114: [info] start embed etcd
2018/03/01 10:51:40 systime_mon.go:11: [info] start system time monitor 
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for peers on  http://127.0.0.1:2380]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [pprof is enabled under /debug/pprof]
2018/03/01 10:51:40.011 log.go:84: [info] embed: [listening for client requests on  127.0.0.1:2379]
2018/03/01 10:51:40.019 log.go:84: [info] etcdserver: [name = pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [data dir = default.pd]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [member dir = default.pd/member]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [heartbeat = 500ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [election = 3000ms]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [snapshot count = 100000]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [advertise client URLs = http://127.0.0.1:2379]
2018/03/01 10:51:40.020 log.go:84: [info] etcdserver: [restarting member b71f75320dc06a6c in cluster 1c45a069f3a1d796 at commit index 20]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [b71f75320dc06a6c became follower at term 2]
2018/03/01 10:51:40.020 log.go:84: [info] raft: [newRaft b71f75320dc06a6c [peers: [], term: 2, commit: 20, applied: 0, lastindex: 20, lastterm: 2]]
2018/03/01 10:51:40.072 log.go:80: [warning] auth: [simple token is not cryptographically signed]
2018/03/01 10:51:40.113 log.go:84: [info] etcdserver: [starting server... [version: 3.2.4, cluster version: to_be_decided]]
2018/03/01 10:51:40.115 log.go:84: [info] etcdserver/membership: [added member b71f75320dc06a6c [http://127.0.0.1:2380] to cluster 1c45a069f3a1d796]
2018/03/01 10:51:40.116 etcdutil.go:62: [error] failed to get raft cluster member(s) from the given urls.
2018/03/01 10:51:40.116 server.go:129: [info] create etcd v3 client with endpoints [http://127.0.0.1:2379]
2018/03/01 10:51:40.116 log.go:82: [info] etcdserver/membership: [set the initial cluster version to 3.2]
2018/03/01 10:51:40.116 log.go:84: [info] etcdserver/api: [enabled capabilities for version 3.2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c is starting a new election at term 2]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became candidate at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c received MsgVoteResp from b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [b71f75320dc06a6c became leader at term 3]
2018/03/01 10:51:41.021 log.go:84: [info] raft: [raft.node: b71f75320dc06a6c elected leader b71f75320dc06a6c at term 3]
2018/03/01 10:51:41.039 log.go:84: [info] etcdserver: [published {Name:pd ClientURLs:[http://127.0.0.1:2379]} to cluster 1c45a069f3a1d796]
2018/03/01 10:51:41.039 log.go:84: [info] embed: [ready to serve client requests]
2018/03/01 10:51:41.040 log.go:82: [info] embed: [serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!]
2018/03/01 10:51:41.066 server.go:174: [info] init cluster id 66666666666
2018/03/01 10:51:41.250 cache.go:379: [info] load 0 stores cost 465.361µs
2018/03/01 10:51:41.251 cache.go:385: [info] load 0 regions cost 426.452µs
2018/03/01 10:51:41.251 coordinator.go:123: [info] coordinator: Start collect cluster information
2018/03/01 10:51:41.251 coordinator.go:126: [info] coordinator: Cluster information is prepared
2018/03/01 10:51:41.251 coordinator.go:136: [info] coordinator: Run scheduler
2018/03/01 10:51:41.252 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2018-03-01 10:51:44.251760951 +0800 CST
2018/03/01 10:51:41.252 leader.go:249: [info] PD cluster leader pd is ready to serve
^C2018/03/01 10:51:56.077 server.go:228: [info] closing server
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-hot-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-region-scheduler stopped: context canceled
2018/03/01 10:51:56.077 coordinator.go:277: [info] balance-leader-scheduler stopped: context canceled
2018/03/01 10:51:56.077 leader.go:107: [error] campaign leader err github.com/pingcap/pd/server/leader.go:269: server closed
2018/03/01 10:51:56.078 leader.go:65: [info] server is closed, return leader loop
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver: [skipped leadership transfer for single member cluster]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: getsockopt: connection refused"; Reconnecting to {127.0.0.1:2379 <nil>}]
2018/03/01 10:51:56.078 log.go:84: [info] etcdserver/api/v3rpc: [Failed to dial 127.0.0.1:2379: grpc: the connection is closing; please retry.]
2018/03/01 10:51:56.118 server.go:246: [info] close server
2018/03/01 10:51:56.118 main.go:89: [info] Got signal [2] to exit.