Search code examples
ssldockerdocker-swarm

Docker swarm TLS Failed to validate pending node


I am having this log on my swarm manage container:

time="2016-04-15T02:47:59Z" level=debug msg="Failed to validate pending node: lookup node1 on 10.0.2.3:53: server misbehaving" Addr="node1:2376"

I have set up a github repo to reproduce my problem: https://github.com/casertap/playing-with-swarm-tls I am running a cluster ok 2 machine (built with vagrant)

$script2 = <<STOP
service docker stop
sed -i 's/DOCKER_OPTS=/DOCKER_OPTS="-H tcp:\\/\\/0.0.0.0:2376 -H unix:\\/\\/\\/var\\/run\\/docker.sock --tlsverify --tlscacert=\\/home\\/vagrant\\/.certs\\/ca.pem --tlscert=\\/home\\/vagrant\\/.certs\\/cert.pem --tlskey=\\/home\\/vagrant\\/.certs\\/key.pem"/' /etc/init/docker.conf
service docker start
STOP

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
    config.vm.box = "ubuntu/trusty64"

    config.vm.define "node1" do |app|
        app.vm.network "private_network", ip: "192.168.33.10"
        app.vm.provision "file", source: "ca.pem", destination: "~/.certs/ca.pem"
        app.vm.provision "file", source: "node1-cert.pem", destination: "~/.certs/cert.pem"
        app.vm.provision "file", source: "node1-priv-key.pem", destination: "~/.certs/key.pem"
        app.vm.provision "file", source: "node1.csr", destination: "~/.certs/node1.csr"
        app.vm.provision "docker"
        app.vm.provision :shell, :inline => $script2
    end
    config.vm.define "swarm" do |app|
        app.vm.network "private_network", ip: "192.168.33.12"
        app.vm.provision "shell", inline: "echo '192.168.33.10 node1' >> /etc/hosts"
        app.vm.provision "shell", inline: "echo '192.168.33.12 swarm' >> /etc/hosts"
        app.vm.provision "docker"
        app.vm.provision "file", source: "ca.pem", destination: "~/.certs/ca.pem"
        app.vm.provision "file", source: "swarm-cert.pem", destination: "~/.certs/cert.pem"
        app.vm.provision "file", source: "swarm-priv-key.pem", destination: "~/.certs/key.pem"
        app.vm.provision "file", source: "swarm.csr", destination: "~/.certs/swarm.csr"
    end
end

As you can see my node1 /etc/init/docker.conf has the options:

DOCKER_OPTS="-H tcp:\\/\\/0.0.0.0:2376 -H unix:\\/\\/\\/var\\/run\\/docker.sock --tlsverify --tlscacert=\\/home\\/vagrant\\/.certs\\/ca.pem --tlscert=\\/home\\/vagrant\\/.certs\\/cert.pem --tlskey=\\/home\\/vagrant\\/.certs\\/key.pem"

I do

vagrant up

then I connect to swarm

vagrant ssh swarm
export TOKEN=$(docker run swarm create)
#dd182b8d2bc8c03f417376296558ba29

docker run -d swarm join --advertise node1:2376 token://dd182b8d2bc8c03f417376296558ba29

node1 is defined in the /etc/hosts file as you can see on the vagrant provision file.

Start the swarm manager with log debug level (wihthout -d)

docker run -p 3376:3376 -v /home/vagrant/.certs:/certs:ro swarm -l debug manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --host=0.0.0.0:3376 token://dd182b8d2bc8c03f417376296558ba29

The log is showing me:

time="2016-04-15T02:47:59Z" level=debug msg="Failed to validate pending node: lookup node1 on 10.0.2.3:53: server misbehaving" Addr="node1:2376"

my node1 ip address in /etc/hosts is actually:

192.168.33.10 node1

It seems that docker is trying to lookup the node1 alias on the wrong bridge network?

========== more info:

You can check this url to see if the discovery service found your node1 and it does:

https://discovery.hub.docker.com/v1/clusters/dd182b8d2bc8c03f417376296558ba29

Now if you run the swarm manager with -d and do:

vagrant@vagrant-ubuntu-trusty-64:~$ docker --tlsverify --tlscacert=/home/vagrant/.certs/ca.pem --tlscert=/home/vagrant/.certs/cert.pem --tlskey=/home/vagrant/.certs/key.pem -H swarm:3376 info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 1
 (unknown): node1:2376
  └ Status: Pending
  └ Containers: 0
  └ Reserved CPUs: 0 / 0
  └ Reserved Memory: 0 B / 0 B
  └ Labels:
  └ Error: (none)
  └ UpdatedAt: 2016-04-15T03:03:28Z
  └ ServerVersion:
Plugins:
 Volume:
 Network:
Kernel Version: 3.13.0-85-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: ee85273cbb64
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support

You see the node has being: Pending


Solution

  • Although you define node1 in your machine's /etc/hosts, the container that swarm manager is running doesn't have node1 in its /etc/hosts file. By default a container doesn't share the host's file system. See https://docs.docker.com/engine/userguide/containers/dockervolumes/. Swarm manager tries to look up node1 thru DNS resolver and fails.

    There are several options to resolve this.

    1. Use a resolvable FQDN so Swarm manager in the container can resolve the node
    2. Or provide node1's IP in swarm join command
    3. Or pass /etc/hosts file from host to the Swarm manager container using -v option. See the link above.