Search code examples
dockerdocker-swarm

Docker healthcheck stops working after a while


I am running docker in a Raspberry Pi 3 Model B Plus Rev 1.3, running Raspberry pi OS with all packages up to date.

TL;DR

The healthchecks on a given container works fine for some time (around 30 min, some times less some times more), but at some point they get "stuck" and so the container remains healthy, even though it is not the case. Is there a way to debug what's going on with the healthchecks and so try to figure out what is happening?

the healthcheck is not configured in the Dockerfile, but instead in the yml file I use to deploy the stack as follows

healthcheck:
  test: ["CMD-SHELL", "curl -f -s -o /dev/null https://my.domain.com/icon/none.png || exit 1"]
  start_period: 1m
  interval: 5s
  timeout: 2s
  retries: 3

When I start the container I keep checking docker inspect and I see the different healthchecks happening every 5 seconds, as defined... but at some point, they simply stop, and I have no idea why, as can be seen below

pi@openhab:~ $ date
Thu Sep 30 01:45:46 UTC 2021

pi@openhab:~ $ docker inspect ebfa93c5e815                                                                                                                                                                                                                                                   
[                                                                                                                                                                                                                                                                                            
    {                                                                                                                                                                                                                                                                                        
        "Id": "ebfa93c5e815592879b6862b33a1a384cc43b60093f8df5c1a8d51ba25a7d0ef",                                                                                                                                                                                                            
        "Created": "2021-09-30T00:36:17.319888926Z",                                                                                                                                                                                                                                         
        "Path": "/entrypoint.sh",                                                                                                                                                                                                                                                            
        "Args": [],                                                                                                                                                                                                                                                                          
        "State": {                                                                                                                                                                                                                                                                           
            "Status": "running",                                                                                                                                                                                                                                                             
            "Running": true,                                                                                                                                                                                                                                                                 
            "Paused": false,                                                                                                                                                                                                                                                                 
            "Restarting": false, 
            "OOMKilled": false,                                                                                                                                                                                                                                                              
            "Dead": false,                                                                                                                                                                                                                                                                   
            "Pid": 3743,                                                                                                                                                                                                                                                                     
            "ExitCode": 0,                                       
            "Error": "",                    
            "StartedAt": "2021-09-30T00:36:24.648900024Z",              
            "FinishedAt": "0001-01-01T00:00:00Z",                                                                                                                                                                                                                                            
            "Health": {                                                                                                                                                                                                                                                                      
                "Status": "healthy",                                                                                                                                                                                             
                "FailingStreak": 0,                                                                                             
                "Log": [                                                                                                     
                    {                                                                                                      
                        "Start": "2021-09-30T01:05:37.394601872Z",
                        "End": "2021-09-30T01:05:38.510395101Z",
                        "ExitCode": 0,  
                        "Output": ""
                    },                                         
                    {                    
                        "Start": "2021-09-30T01:05:43.538165679Z",
                        "End": "2021-09-30T01:05:44.701265903Z",
                        "ExitCode": 0,
                        "Output": ""
                    },               
                    {          
                        "Start": "2021-09-30T01:05:49.731086207Z",
                        "End": "2021-09-30T01:05:50.940299522Z",
                        "ExitCode": 0,
                        "Output": ""                                               
                    },         
                    {              
                        "Start": "2021-09-30T01:05:55.971634397Z",
                        "End": "2021-09-30T01:05:57.222192641Z",
                        "ExitCode": 0,
                        "Output": ""                                                             
                    },                
                    {                  
                        "Start": "2021-09-30T01:06:02.251407253Z",
                        "End": "2021-09-30T01:06:03.402660632Z",
                        "ExitCode": 0,
                        "Output": ""
                    }
                ]
            }
        },

As can be seen, healthchecks are working fine up to about 30 minutes after the container is up, and then they simply stop. Current time is 40 minutes after the last healthcheck

Versions

$ docker version
Client:
 Version:           18.09.1
 API version:       1.39
 Go version:        go1.11.6
 Git commit:        4c52b90
 Built:             Fri, 13 Sep 2019 10:45:43 +0100
 OS/Arch:           linux/arm
 Experimental:      false

Server:
 Engine:
  Version:          18.09.1
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.6
  Git commit:       4c52b90
  Built:            Fri Sep 13 09:45:43 2019
  OS/Arch:          linux/arm
  Experimental:     false
pi@openhab:~ $ docker info
Containers: 41
 Running: 6       
 Paused: 0                 
 Stopped: 35                                    
Images: 51   
Server Version: 18.09.1
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true                                     
Logging Driver: json-file       
Cgroup Driver: cgroupfs   
Plugins:                  
 Volume: local                       
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active       
 NodeID: jze7gn1w7y5fuk9ykv9omvuwh
 Is Manager: true          
 ClusterID: 0zmswkmc5o699wichuas93j83
 Managers: 1                    
 Nodes: 1                     
 Default Address Pool: 10.0.0.0/8      
 SubnetSize: 24                     
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 10
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 192.168.2.104
 Manager Addresses:
  192.168.2.104:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 1.0.0~rc6+dfsg1-3
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 seccomp
  Profile: default
Kernel Version: 5.10.60-v7+
Operating System: Raspbian GNU/Linux 10 (buster)
OSType: linux
Architecture: armv7l
CPUs: 4
Total Memory: 923.2MiB
Name: openhab
ID: IL4N:6VFR:HOFK:7DL7:KMAS:PCNQ:7KOD:2JOM:R6I2:A5GD:HO7E:4CJQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support

What I am trying to do

I have an openhab installation running in the raspberry pi, which I want to be able to access remotely. The rPi is connected to a router, which is connected to a modem and I don't have a static IP, nor I want to have a hostname dynamically updated to point at my IP, then configure port forwarding in the modem and router and so on... So instead, I do have a paid server with a static IP, and so I want to simply run SSH from the rpi to the remote server, and do a reverse port forward so I can reach openhab from the remote server. I want this ssh connection to be automatically started when the rpi is booted, and if for whatever reason I cannot reach some resource remotely (pretty much the curl test from the healthcheck) then restart the connection. I have created a docker image with the following Dockerfile

FROM alpine:3.11
RUN apk add --no-cache \
  curl \
  openssh-client \
  ca-certificates \
  bash

COPY known_hosts /known_hosts
COPY private_key /private_key
RUN chmod 0400 /private_key
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]

And the entrypoint.sh is simply

#!/bin/bash
ssh -Nn user@my.domain.com -i /private_key -o UserKnownHostsFile=/known_hosts -R 127.0.0.1:17280:openhab:8080

Now, this works fantastic while the healthchecks are running... I can reboot the remote server, then swarm would restart the ssh-client container... I can stop openhab, then swarm restarts the ssh-client... I can disconnect the rpi from the internet, swarm restarts the ssh-client... this is all fine, and working as I expect it, until for whatever reason, healthchecks simply stop for no apparent reason, and the container remains as "healthy" forever... I still have 60% free RAM and 62% free disk space... anyone has any idea what might be happening? or has any suggestion? I cannot find logs either...


Solution

  • This issue appears to no longer be happening. I upgraded to Raspbian bullseye, and healthchecks have been running for a week straight, without issues.

    pi@openhab:~ $ docker version
    Client:
     Version:           20.10.5+dfsg1
     API version:       1.41
     Go version:        go1.15.9
     Git commit:        55c4c88
     Built:             Sat Dec  4 10:53:03 2021
     OS/Arch:           linux/arm
     Context:           default
     Experimental:      true
    
    Server:
     Engine:
      Version:          20.10.5+dfsg1
      API version:      1.41 (minimum version 1.12)
      Go version:       go1.15.9
      Git commit:       363e9a8
      Built:            Sat Dec  4 10:53:03 2021
      OS/Arch:          linux/arm
      Experimental:     false
     containerd:
      Version:          1.4.13~ds1
      GitCommit:        1.4.13~ds1-1~deb11u1
     runc:
      Version:          1.0.0~rc93+ds1
      GitCommit:        1.0.0~rc93+ds1-5
     docker-init:
      Version:          0.19.0
      GitCommit: