Search code examples
google-kubernetes-engineetcdapache-apisix

ETCD troubles on GKE


I do run a GKE Cluster with 3 Nodes. Beside several application I also deployed the APISIX gateway on the cluster (chart: apisix, repoURL: https://charts.apiseven.com, targetRevision: "0.11.0"), which does deploy an etcd-cluster (version 3.4.14) with 3 nodes.

Now it gets funny, the etcd cluster builds up fine and everything is ok until every-day at 5:00 am, at this time the 3rd member of the cluster is leaving the cluster, the second node just stays fine. (See the logs below)

Logs (etcd-0 node)

2022-09-29 04:59:04.652 CEST etcd {"caller":"etcdserver/zap_raft.go:77", "level":"info", "logger":"raft", "msg":"90126cc714381e07 switched to configuration voters=(3177002992052145560 10381479693335928327)", "ts":"2022-09-29T02:59:04.652Z"}
2022-09-29 04:59:04.653 CEST etcd {"caller":"membership/cluster.go:472", "cluster-id":"b0d7015fda1525c8", "level":"info", "local-member-id":"90126cc714381e07", "msg":"removed member", "removed-remote-peer-id":"3ff1b5cd453a87df", "removed-remote-peer-urls":[…], "ts":"2022-09-29T02:59:04.653Z"}
2022-09-29 04:59:04.653 CEST etcd {"caller":"rafthttp/peer.go:330", "level":"info", "msg":"stopping remote peer", "remote-peer-id":"3ff1b5cd453a87df", "ts":"2022-09-29T02:59:04.653Z"}

Logs (etcd-1 node)

04:59:04.655 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream MsgApp v2, ts: 2022-09-29T02:59:04.654Z}
04:59:04.678 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream Message, ts: 2022-09-29T02:59:04.656Z}
04:59:04.678 CEST{caller: etcdserver/zap_raft.go:77, level: info, logger: raft, msg: 3ff1b5cd453a87df switched to configuration voters=(3177002992052145560 10381479693335928327), ts: 2022-09-29T02:59:04.653Z}
04:59:04.678 CEST{caller: membership/cluster.go:472, cluster-id: b0d7015fda1525c8, level: info, local-member-id: 3ff1b5cd453a87df, msg: removed member, removed-remote-peer-id: 3ff1b5cd453a87df, removed-remote-peer-urls: […], ts: 2022-09-29T02:59:04.657Z}
04:59:04.678 CEST{caller: rafthttp/peer_status.go:66, error: failed to dial 90126cc714381e07 on stream MsgApp v2 (the member has been permanently removed from the cluster), level: warn, msg: peer became inactive (message send to peer failed), peer-id: 90126cc714381e07, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: etcdserver/server.go:1150, error: the member has been permanently removed from the cluster, level: warn, msg: server error, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: etcdserver/server.go:1151, level: warn, msg: data-dir used by this member must be removed, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: rafthttp/peer.go:330, level: info, msg: stopping remote peer, rem

Logs (etcd-2 node)

04:59:04.655 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream MsgApp v2, ts: 2022-09-29T02:59:04.654Z}
04:59:04.678 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream Message, ts: 2022-09-29T02:59:04.656Z}
04:59:04.678 CEST{caller: etcdserver/zap_raft.go:77, level: info, logger: raft, msg: 3ff1b5cd453a87df switched to configuration voters=(3177002992052145560 10381479693335928327), ts: 2022-09-29T02:59:04.653Z}
04:59:04.678 CEST{caller: membership/cluster.go:472, cluster-id: b0d7015fda1525c8, level: info, local-member-id: 3ff1b5cd453a87df, msg: removed member, removed-remote-peer-id: 3ff1b5cd453a87df, removed-remote-peer-urls: […], ts: 2022-09-29T02:59:04.657Z}
04:59:04.678 CEST{caller: rafthttp/peer_status.go:66, error: failed to dial 90126cc714381e07 on stream MsgApp v2 (the member has been permanently removed from the cluster), level: warn, msg: peer became inactive (message send to peer failed), peer-id: 90126cc714381e07, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: etcdserver/server.go:1150, error: the member has been permanently removed from the cluster, level: warn, msg: server error, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: etcdserver/server.go:1151, level: warn, msg: data-dir used by this member must be removed, ts: 2022-09-29T02:59:04.659Z}
04:59:04.678 CEST{caller: rafthttp/peer.go:330, level: info, msg: stopping remote peer, remote-peer-id: 2c16fb63879f0d98, ts: 2022-09-29T02:59:04.660Z}

I've observed the behaviour now serveral times and I have no idea what caused it. For me it seems to be a "problem of the GKE" but I don't know how to solve it.

In the past I observed a similar behaviour with a Vault cluster I build on the GKE-cluster and couldn't solve the problem either.


Solution

  • I followed the recommandation posted here (https://github.com/etcd-io/etcd/issues/14542).

    I switched to a newer etcd version and set the removeMemberOnContainerTermination flag to false.
    Until now the cluster runs stable.