Context: Host is AWS-EC2 / Ubuntu 14.04.5 with Docker version 17.05.0-ce. Containers are built from publicly available repo image cbhihe/serf-alpine-bash
. All containers are located on the same EC2 instance and share the same default bridge network with net-interface "docker0".
Trying to join nodes serfDC1 (id d4fd90692e18) and serfDC2 (id 6353e7f6134d), by passing cmds from the host's shell:
$ docker exec serfDC1 serf agent -node=Node1 -bind=0.0.0.0:7946
==> Starting Serf agent…
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: 'd4fd90692e18'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Log data will now stream in as it occurs:
2017/06/04 00:01:10 [INFO] agent: Serf agent starting
2017/06/04 00:01:10 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:01:11 [INFO] agent: Received event: member-join
^C
After discovering Node1's container's IP=172.17.0.4, I can issue the serf agent -join
cmd to Node2:
$ docker exec serfDC2 serf agent -node=Node2 -join=172.17.0.4
==> Starting Serf agent...
==> Starting Serf agent RPC...
==> Serf agent running!
Node name: '6353e7f6134d'
Bind addr: '0.0.0.0:7946'
RPC addr: '127.0.0.1:7373'
Encrypted: false
Snapshot: false
Profile: lan
==> Joining cluster...(replay: false)
Join completed. Synced with 1 initial agents
==> Log data will now stream in as it occurs:
2017/06/04 00:18:35 [INFO] agent: Serf agent starting
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: 6353e7f6134d 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joining: [172.17.0.4] replay: false
2017/06/04 00:18:35 [INFO] serf: EventMemberJoin: d4fd90692e18 127.0.0.1
2017/06/04 00:18:35 [INFO] agent: joined: 1 nodes
2017/06/04 00:18:36 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:36 [INFO] agent: Received event: member-join
2017/06/04 00:18:37 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34876
2017/06/04 00:18:37 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:37 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:38 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:39 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34879
2017/06/04 00:18:39 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:40 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:18:41 [WARN] memberlist: Got ping for unexpected node 'd4fd90692e18' from=127.0.0.1:7946
2017/06/04 00:18:42 [WARN] memberlist: Got ping for unexpected node d4fd90692e18 from=127.0.0.1:34881
2017/06/04 00:18:42 [ERR] memberlist: Failed TCP fallback ping: EOF
2017/06/04 00:18:42 [INFO] memberlist: Marking d4fd90692e18 as failed, suspect timeout reached (0 peer confirmations)
2017/06/04 00:18:42 [INFO] serf: EventMemberFailed: d4fd90692e18 127.0.0.1
2017/06/04 00:18:43 [INFO] agent: Received event: member-failed
2017/06/04 00:18:44 [INFO] memberlist: Suspect d4fd90692e18 has failed, no acks received
2017/06/04 00:19:05 [INFO] serf: attempting reconnect to d4fd90692e18 127.0.0.1:7946
^C
Resulted in failure to join as shown by:
$ docker exec serfDC2 serf members
6353e7f6134d 127.0.0.1:7946 alive
d4fd90692e18 127.0.0.1:7946 failed
$ docker exec serfDC1 serf members
d4fd90692e18 127.0.0.1:7946 alive
6353e7f6134d 127.0.0.1:7946 failed
I have been at this for quite some time now and am at my wit's end as to where I should turn. Hashicorp's and Docker's documentation do not seem to cover this aspect of the initial handshake between two serf agents in different containers.
Could somebody show me where I took a wrong turn ? Any answer would be great, really. Tx.
Serf nodes need to 'announce' themselves with a routable address. In your case they are telling to each other: 'hi, I'm localhost:...', so each one tries to answer to localhost, which is something wrong because each container has its own localhost.
There is an option to configure the agent to use the eth0
ip to advertise to the others nodes in the network: -iface
. Then you need to discard the -bind
option. Those ports are default so there is no need to customize.
So, for the node1:
serf agent -node=Node1 -iface=eth0
And for the node2:
serf agent -node=Node2 -join=172.17.0.2 -iface=eth0
From docs:
-iface - This flag can be used to provide a binding interface. It can be used instead of -bind if the interface is known but not the address.
It's working properly for me:
Node1:
==> Log data will now stream in as it occurs:
2017/06/04 01:56:40 [INFO] agent: Serf agent starting
2017/06/04 01:56:40 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:56:41 [INFO] agent: Received event: member-join
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Node2:
==> Log data will now stream in as it occurs:
2017/06/04 01:57:02 [INFO] agent: Serf agent starting
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node2 172.17.0.3
2017/06/04 01:57:02 [INFO] agent: joining: [172.17.0.2] replay: false
2017/06/04 01:57:02 [INFO] serf: EventMemberJoin: Node1 172.17.0.2
2017/06/04 01:57:02 [INFO] agent: joined: 1 nodes
2017/06/04 01:57:03 [INFO] agent: Received event: member-join
Edit:
In the case that each container is in its own VM (EC2 instance), as each instance has its own docker network and not interconnected, you have to provide the EC2 instance IP and expose the corresponding ports. Use -advertise
-advertise - The advertise flag is used to change the address that we advertise to other nodes in the cluster.
Node1:
serf agent -node=Node1 -iface=eth0 -advertise=INSTANCE_IP
Node2:
serf agent -node=Node2 -join=NODE1_INSTANCE_IP -iface=eth0
And remember to expose the serf port in docker run
docker run -p 7946:7946 (...rest of the command...)