Search code examples
networkingconsul

Consul server fails to rejoin the cluster, leaves after few seconds


a short overview of my case:

Originally I had a simple running consul cluster with one server (server1) and several clients. Here is my server1.json configuration file:

{
    "server": true,
    "datacenter": "dc1",
    "data_dir": "/opt/consul",
    "bind_addr": "0.0.0.0",
    "client_addr":"0.0.0.0",
    "bootstrap_expect": 3,
    "ui": true,
    "retry_join": ["provider=aws tag_key=Function tag_value=consul-server"],
    "encrypt":"WT2T9..."
}

NOTE: bootstrap_expect was originally = 1, and I changed to 3 when I added new servers.

For some testing purposes I want to add two more servers to the cluster.

So I configure and add the servers one by one. Everything works as expected and I can confirm that my cluster now has 3 servers with two new servers (server1 and server2) being followers.

Here is my server2.json configuration file:

The config file for server3 looks the same.

Now when my cluster is running, for testing purposes I stoped the service of the server1, which was the leader. As a result, new leader was elected.

consul members command shows:

Node            Address           Status   Type   Build  Protocol  DC   Segment
server2  xx.xx.x.xx1:8301  alive   server  1.9.3  2         dc1  <all>
server1  xx.xx.x.xx2:8301  left    server  1.8.3  2         dc1  <all>
server3  xx.xx.x.xx3:8301  alive   server  1.9.3  2         dc1  <all>

Looks good! Server1 has left.

Now I want to bring server1 back to my cluster.

So I restart the consul.service of the server1 and I expect that it will rejoin the cluster.

And it joins it, but in several seconds it is failed and leaves the cluster.

Here are some outputs:

The output of onsul members:

 consul.service - Consul server agent
   Loaded: loaded (/etc/systemd/system/consul.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-02-04 13:33:25 UTC; 3s ago
     Docs: https://www.consul.io/
  Process: 18632 ExecStart=/usr/local/bin/consul agent -ui -config-dir=/etc/consul.d/ (code=exited, status=2)
 Main PID: 18632 (code=exited, status=2)

Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.

The output of journalctl -xe -u consul shows me the following

Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.267Z [INFO]  agent: discover-aws: Instance i-0ae416a280967e345 has private ip 10.10.1.555: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: discover-aws: Instance i-0e5394a632853e200 has private ip 10.10.1.777: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: discover-aws: Instance i-019966401267318b4 has private ip 10.10.1.222: cluster=LAN
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="10.10.1.555 10.10.1.777 10.10.1.222"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.268Z [INFO]  agent: (LAN) joining: lan_addresses=[10.10.1.555, 10.10.1.777, 10.10.1.222]
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.271Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-777 10.10.1.777
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.271Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-10-10-1-222 10.10.1.222
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.272Z [INFO]  agent.server: Adding LAN server: server="ip-10-10-1-777 (Addr: tcp/10.10.1.777:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.272Z [INFO]  agent.server: Adding LAN server: server="ip-10-10-1-222 (Addr: tcp/10.10.1.41:8300) (DC: dc1)"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.274Z [INFO]  agent: (LAN) joined: number_of_nodes=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.275Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.278Z [WARN]  agent.server.raft: failed to get previous log: previous-index=93400 last-index=93262 error="log not found"
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.505Z [INFO]  agent: Synced node info
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:     2021-02-04T13:33:25.526Z [INFO]  agent: Deregistered check: check=vault:10.10.1.555:8200:vault-sealed-check
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: panic: failed to decode request: invalid config entry kind: service-intentions
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: goroutine 37 [running]:
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).applyConfigEntryOperation(0xc000923c80, 0xc0008b0421, 0x150, 0x150, 0x169ef, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/commands_oss.go:453 +0xb27
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.New.func1(0xc0008b0421, 0x150, 0x150, 0x169ef, 0xc000538a50, 0xc000eb9380)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/fsm.go:85 +0x56
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/consul/agent/consul/fsm.(*FSM).Apply(0xc000923c80, 0xc0013702d0, 0x0, 0x0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /home/circleci/project/consul/agent/consul/fsm/fsm.go:120 +0x24b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/go-raftchunking.(*ChunkingFSM).Apply(0xc000984780, 0xc0013702d0, 0x53a3f80, 0xbfff1bc9745fc84c)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:66 +0x5b
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc000956fe0)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:90 +0x2c1
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0xc0007ec400, 0x40, 0x40)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:113 +0x75
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*Raft).runFSM(0xc000302900)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/fsm.go:219 +0x42f
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc000302900, 0xc00085e340)
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:146 +0x55
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]: created by github.com/hashicorp/raft.(*raftState).goFunc
Feb 04 13:33:25 ip-xx-xx-x-xxx consul[18632]:         /go/pkg/mod/github.com/hashicorp/[email protected]/state.go:144 +0x66
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 04 13:33:25 ip-xx-xx-x-xxx systemd[1]: consul.service: Failed with result 'exit-code'.

As you can see from the second output, the server1 is actually joining the cluster, but then smth happens and it leaves it.

panic: failed to decode request: invalid config entry kind: service-intentions

I tried to modify configuration file in different ways and restart the consul server1 --> without success.

Finally I emptied the consul data_dir and restarted the server1 anew --> without success.

Has anybody experienced something like this and can support here?


Solution

  • If you look few lines above the provided "panic" line, you can see:

    [WARN] agent.server.raft: failed to get previous log: previous-index=93400 last-index=93262 error="log not found".

    It seems like there is a problem with logs replication. The current leader is not able to replicate the logs to the new follower.

    While I guess there can be multiple reasons for this, one possible problem can be in the version of consul. As far as I see from your consul members command, the version of server1 is older than version of your leader (whether it is server2 or server3).

    Node            Address           Status   Type   Build  Protocol  DC   Segment
    server2  xx.xx.x.xx1:8301  alive   server  1.9.3  2         dc1  <all>
    server1  xx.xx.x.xx2:8301  left    server  1.8.3  2         dc1  <all>
    server3  xx.xx.x.xx3:8301  alive   server  1.9.3  2         dc1  <all>
    

    Try to delete the consul binary from the problematic node (in your case server1) and replace it with the new version. After that restart consul.