Search code examples
dockerdocker-composehealth-check

docker compose: hundreds of health check processes not terminated


docker compose: hundreds of health check processes not terminated.

services:
  tomcat:
    ...
    healthcheck:
      test:
        - CMD-SHELL
        - curl --fail http://localhost:8080 || exit 1
      interval: 5s
      timeout: 5s
      retries: 30

ps aux | grep curl

tomcat    939765  0.0  0.0      0     0 ?        Z    08:46   0:00 [curl] <defunct>
tomcat    939824  0.0  0.0      0     0 ?        Z    08:46   0:00 [curl] <defunct>
tomcat    939904  0.0  0.0      0     0 ?        Z    08:47   0:00 [curl] <defunct>
tomcat    939962  0.0  0.0      0     0 ?        Z    08:47   0:00 [curl] <defunct>
tomcat    940038  0.0  0.0      0     0 ?        Z    08:47   0:00 [curl] <defunct>
tomcat    940094  0.0  0.0      0     0 ?        Z    08:47   0:00 [curl] <defunct>
tomcat    940321  0.0  0.0      0     0 ?        Z    08:48   0:00 [curl] <defunct>
tomcat    940380  0.0  0.0      0     0 ?        Z    08:48   0:00 [curl] <defunct>
tomcat    940460  0.0  0.0      0     0 ?        Z    08:48   0:00 [curl] <defunct>
tomcat    940516  0.0  0.0      0     0 ?        Z    08:48   0:00 [curl] <defunct>
tomcat    940600  0.0  0.0      0     0 ?        Z    08:49   0:00 [curl] <defunct>
tomcat    940657  0.0  0.0      0     0 ?        Z    08:49   0:00 [curl] <defunct>
tomcat    940734  0.0  0.0      0     0 ?        Z    08:49   0:00 [curl] <defunct>
tomcat    940875  0.0  0.0      0     0 ?        Z    08:49   0:00 [curl] <defunct>
tomcat    940955  0.0  0.0      0     0 ?        Z    08:50   0:00 [curl] <defunct>
tomcat    941013  0.0  0.0      0     0 ?        Z    08:50   0:00 [curl] <defunct>
tomcat    941102  0.0  0.0      0     0 ?        Z    08:50   0:00 [curl] <defunct>
tomcat    941162  0.0  0.0      0     0 ?        Z    08:50   0:00 [curl] <defunct>
tomcat    941244  0.0  0.0      0     0 ?        Z    08:51   0:00 [curl] <defunct>
tomcat    941332  0.0  0.0      0     0 ?        Z    08:51   0:00 [curl] <defunct>
tomcat    941392  0.0  0.0      0     0 ?        Z    08:51   0:00 [curl] <defunct>
tomcat    941474  0.0  0.0      0     0 ?        Z    08:51   0:00 [curl] <defunct>
tomcat    941532  0.0  0.0      0     0 ?        Z    08:52   0:00 [curl] <defunct>
tomcat    941609  0.0  0.0      0     0 ?        Z    08:52   0:00 [curl] <defunct>
tomcat    941671  0.0  0.0      0     0 ?        Z    08:52   0:00 [curl] <defunct>
tomcat    941749  0.0  0.0      0     0 ?        Z    08:52   0:00 [curl] <defunct>
tomcat    941810  0.0  0.0      0     0 ?        Z    08:53   0:00 [curl] <defunct>
....

tomcat    941895  0.0  0.2  22364  8436 ?        S    08:53   0:00 curl --fail http://localhost:8080
tomcat    941954  0.0  0.2  22364  8512 ?        S    08:53   0:00 curl --fail http://localhost:8080
tomcat    942032  0.0  0.2  22364  8384 ?        S    08:53   0:00 curl --fail http://localhost:8080
tomcat    942238  0.0  0.2  22364  8528 ?        S    08:54   0:00 curl --fail http://localhost:8080
tomcat    942316  0.0  0.2  22364  8552 ?        S    08:54   0:00 curl --fail http://localhost:8080
tomcat    942377  0.0  0.2  22364  8496 ?        S    08:55   0:00 curl --fail http://localhost:8080
tomcat    942452  0.0  0.2  22364  8360 ?        S    08:55   0:00 curl --fail http://localhost:8080
...

Will health checker continue to be run periodically even after the container has been checked to be healthy?

What is the reason that "curl" processes are not terminated?


Solution

  • Looking at this output, I read this as the curl command not completing within 5 seconds and getting killed, and the main container process isn't set up to handle a case where it gains responsibility for a child process it didn't start itself.

    I suspect that there are two things you can do to fix this:

    1. Fix your application so that it promptly answers the health check. It shouldn't be in a state where it accepts a connection but then waits for longer than 5 seconds to answer it.
    2. Start your container with init: true.

    What I think is going on depends on some very specific details of how Linux (Unix) processes work. The CMD-SHELL health check is injected by Docker as an additional process in the container process namespace, in the same way as docker exec. But more specifically, there are two processes: a wrapper sh process running the command pipeline, and the curl command as a subprocess.

    /bin/sh -c 'curl --fail http://localhost:8080 || exit 1'
    +-- curl --fail http://localhost:8080
    

    When you reach the timeout, Docker goes to terminate the process. It doesn't specifically know about the subprocess, though, so it sends a signal to the sh process. If it still doesn't terminate, Docker sends it SIGTERM, the Unix signal equivalent of kill -9, and the shell process ceases to exist.

    What happens to the curl process? Its parent used to be the sh process, but that's gone. The standard Unix rules here are that it gets moved to be a child of the "init" process, with process ID 1. In a Docker context, the main container process (your ENTRYPOINT if you have one, your CMD if not) is that process.

    Eventually the curl process will complete, maybe with its own timeout. The standard Unix rules here are that most of the process cleans up but its process table entry stays around, and its parent process can wait(2) for it and find out its status code. A process that's exited but hasn't been waited for is a "zombie" process; that's your long listing of process entries that have Z in the status column and <defunct> at the end of the line. It's worth noting that these aren't using memory or file handles or other resources, only process table entries.

    The combination of these things adds up to: the main process in a container is process ID 1; process ID 1 is expected to be the "init" process; process ID 1 can sometimes unexpectly get children attached to it that it didn't start itself, and it needs to clean up after them ("reap zombies"). If your main process is the Tomcat server, and it's not watching for additional children (or the SIGCHLD signal), then you'll leak processes in the way you're seeing.

    The Compose init: true option wraps the main container command in a lightweight init process, by default Tini. If process 1 needs to reap zombies, Tini does that, and it handles some cases around signals. That's pretty much all it does, but it's an important function.