jenkins ssh port jenkins-plugins jenkins-docker

Jenkins Slave Offline Node Connection timed out / closed - Docker container - Relaunch step - Configuration is picking Old Port

Jenkins version: 1.643.2

Docker Plugin version: 0.16.0

In my Jenkins environment, I have a Jenkins master with 2-5 slave node servers (slave1, slave2, slave3).

Each of these slaves are configured in Jenkins Global configuration using Docker Plugin.

Everything is working at this minute.

I saw our monitoring system throwing some alerts for high SWAP space usage on slave3 (for ex IP: 11.22.33.44) so I ssh'ed to that machine and ran: sudo docker ps which gave me the valid output for the currently running docker containers on this slave3 machine.

By running ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -10 on the target slave's machine (where 4 containers were running), I found the top 5 processes eating all the RAM was java -jar slave.jar running inside each container. So I thought why not restart the shit and recoup some memory back. In the following output, I see what was the state of sudo docker ps command before and after the docker restart <container_instance> step. SCROLL right, you'll notice that in the 2nd line for container ID ending with ...0a02, the virtual port (listed under heading NAMES) on the host (slave3) machine was 1053 (which was mapped to container's virtual IP's port 22 for SSH). Cool, what this means is, when from Jenkins Manage Node section, if you try to Relaunch a slave's container, Jenkins will try to connect to the HOST IP's 11.22.33.44:1053 and do whatever it's supposed to successfully bring the slave up. So, Jenkins is holding that PORT (1053) somewhere.

CONTAINER ID        IMAGE                                                   COMMAND                  CREATED             STATUS              PORTS                  NAMES
ae3eb02a278d        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   26 hours ago        Up 26 hours         0.0.0.0:1048->22/tcp   lonely_lalande
d4745b720a02        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up About an hour    0.0.0.0:1053->22/tcp   cocky_yonath
bd9e451265a6        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up About an hour    0.0.0.0:1050->22/tcp   stoic_bell
0e905a6c3851        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up About an hour    0.0.0.0:1051->22/tcp   serene_tesla

sudo docker restart d4745b720a02; echo $?
d4745b720a02
0

CONTAINER ID        IMAGE                                                   COMMAND                  CREATED             STATUS              PORTS                  NAMES
ae3eb02a278d        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   26 hours ago        Up 26 hours         0.0.0.0:1048->22/tcp   lonely_lalande
d4745b720a02        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up 4 seconds        0.0.0.0:1054->22/tcp   cocky_yonath
bd9e451265a6        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up About an hour    0.0.0.0:1050->22/tcp   stoic_bell
0e905a6c3851        docker.someinstance.coolcompany.com:443/jenkins-slave-stable-image:1.1   "bash -c '/usr/sbin/s"   9 days ago          Up About an hour    0.0.0.0:1051->22/tcp   serene_tesla

After running the sudo docker restart <instanceIDofContainer> I ran free -h / grep -i swap /proc/meminfo and found RAM (which was earlier fully used and was showing only remaining 230MB free) is now 1GB free and SWAP size which was 1G total, 1G used (I tried both swappiness 60 or 10), is now 450MB swap space free. So the alert thing got resolved. Cool.

BUT, now as you notice from the sudo docker ps output above, after the restart step, for that container ID ...0a02, I now got a new PORT# 1054!!

When I went to Manage Nodes > Tried to bring this node offline, stopped it, and relaunched it, Jenkins is NOT picking up the NEW PORT (1054). It's still somehow picking the old port 1053 (while trying to make a SSH connection to 11.22.33.44 (Host's IP) on port 1053 (which is mapped to container's Virtual IP's port # 22 (ssh)).

How can I change this port or configuration in Jenkins for this slave container so that Jenkins will see the new PORT and can successfully relaunch?

PS: Clicking "Configure" on the Node to see it's configuration is NOT showing me anything other than just Name field. Usually there's a lot of fields in a regular slave (where you can define the labels, root dir, launch method, properties env variables, tools for the slave environment but I guess for these Docker containers, I'm not seeing anything other than just Name field). Clicking Test Connection in Jenkins Global configuration (under Docker Plugin's section) shows it's successfully finding Docker version 1.8.3

Right now, as 1053 port (telnet) is not working as it's now 1054 for this container's instanceID (after the restart step), Jenkins relaunch step is failing during SSH connection step (first thing it does to connect via SSH method).

[07/27/17 17:17:19] [SSH] Opening SSH connection to 11.22.33.44:1053.
Connection timed out
ERROR: Unexpected error in launching a slave. This is probably a bug in Jenkins.
java.lang.IllegalStateException: Connection is not established!
    at com.trilead.ssh2.Connection.getRemainingAuthMethods(Connection.java:1030)
    at com.cloudbees.jenkins.plugins.sshcredentials.impl.TrileadSSHPasswordAuthenticator.canAuthenticate(TrileadSSHPasswordAuthenticator.java:82)
    at com.cloudbees.jenkins.plugins.sshcredentials.SSHAuthenticator.newInstance(SSHAuthenticator.java:207)
    at com.cloudbees.jenkins.plugins.sshcredentials.SSHAuthenticator.newInstance(SSHAuthenticator.java:169)
    at hudson.plugins.sshslaves.SSHLauncher.openConnection(SSHLauncher.java:1212)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:711)
    at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:706)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
[07/27/17 17:19:26] Launch failed - cleaning up connection
[07/27/17 17:19:26] [SSH] Connection closed.

Solution

OK. Zeeesus!

In JENKINS_HOME (of the MASTER server), I searched which config file was holding the OLD port# info for that/those container node(s) which were now showing as OFFLINE.

Changed directory to: nodes folder inside $JENKINS_HOME and found that there are config.xml files for each nodes.

For ex: $JENKINS_HOME/nodes/<slave3_node_IP>-d4745b720a02/config.xml

Resolution Steps:

Vim edited the file to change the OLD with NEW port.
Manage Jenkins > Reload configuration from Disk.
Manage Nodes > Selected the particular node which was OFFLINE.
Relaunch slave, and this time Jenkins picked the new PORT and started the container slave as expected (as SSH connection to the new port visible after the configuration change).

I think this page: https://my.company.jenkins.instance.com/projectInstance/docker-plugin/server/<slave3_IP>/ web page, where it shows all the containers info (in a tabular form running on a given slave machine), this page has a button (last column) to STOP a given slave's container but not to START or RESTART.

Having a START or RESTART button there should do what I just did above in some fashion.

Better solution:

What was happening is, all 4 long lived container nodes running on slave3 were competing for gaining all the available RAM (11-12GB) and over the time the JVM process (java -jar slave.jar which the Relaunch step starts on the target container's virtual machine (IP) running on the slave3 slave server) for an individual container were trying to take as much memory (RAM) as they could. That was leading to low FREE memory and thus SWAP getting used and also getting used up to a point where a monitoring tool will start screaming at us via sending notifications etc.

To fix this situation, first thing one should do is:

1) Under Jenkins Global configuration (Manage Jenkins > Configure Systems > Docker Plugin section, for that slave server's Image / Docker Template, under the Advanced Settings section, we can put JVM options to tell the container NOT to compete for all RAM. Putting the following JVM options helped. These JVM settings will try and keep the heap space of each container in a smaller box as to not starve out the rest of the system.

You can start with 3-4GB depending upon how much total RAM you have on your slave/machine where the containers based slave nodes will be running.

2) Look for any recent version of slave.jar (that may have some performance / maintenance enhancements in place which will help.

3) Integrating the monitoring solution (Incinga/etc you have) to auto launch a Jenkins job (where Jenkins job will run some piece of action - BASH one liner, Python shit or Groovy goodness, an Ansible playbook etc) to fix the issue related to any such alert.

4) Automatically have a container slave nodes relaunched (i.e. Relaunch Step) - take slave offline, online, Relaunch step as that'll bring the slave back to a rejuvenated state of freshness. All we have to do is, look for an idle slave (if it's not running any job) then, take it offline > then online > then Relaunch the slave using Jenkins REST API via a small Groovy script and put this all in a Jenkins job and let it do the above if those slave nodes were long lived.

5) OR one can spin the container based slaves on the fly - use and throw model each time Jenkins queues a job to run.