docker kills all processes after 5 min, also fails to get created containers running

For many years, I have been running two websites without any problems, using several Docker containers on a virtual server that was once set up with CoreOS. And I never encountered a situation which I did not understand.

Until now. Since the last week, I have been struggling with phenomena that I can neither understand nor get under control.

Prerequisite

For some reason, I had to restart the machine. The automatic process to start the containers failed. I hadn't changed anything on the machine, so this was unexpected and I had no clue.

I therefore suspended the automatic process to be able to investigate the phenomenon. To begin with, I made sure that the machine at least starts the Docker process itself properly and without any errors:

# systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-07-14 19:05:13 CEST; 7s ago
     Docs: https://docs.docker.com
 Main PID: 123469 (dockerd)
    Tasks: 8
   Memory: 80.4M
   CGroup: /system.slice/docker.service
           └─123469 /usr/bin/dockerd -H fd://

Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.067795763+02:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068075039+02:00" level=warning msg="Your kernel does not support cgroup blkio weight"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068092922+02:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068447780+02:00" level=info msg="Loading containers: start."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.278561566+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to>
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.370232284+02:00" level=info msg="Loading containers: done."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390172816+02:00" level=info msg="Docker daemon" commit=4c52b90 graphdriver(s)=overlay2 version=18.09.1
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390223822+02:00" level=info msg="Daemon has completed initialization"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.455692794+02:00" level=info msg="API listen on /var/run/docker.sock"
Jul 14 19:05:13 IONOS-1 systemd[1]: Started Docker Application Container Engine.

My investigation regarding the warnings with respect to blkio showed that these can be neglected.

My original stack

When I trigger my start process like docker stack deploy -c /root/external.net/wp/docker-compose.yml wp, I notice that all containers appear in the overview with the status created, but neither of them changes to the status running as is normal:

Creating network wp_back_ntw
Creating service wp_adm
Creating service wp_joe
Creating service wp_wp
Creating service wp_master

Instead, all containers are restarted after a while, and this is repeated indefinitely, piling up created containers, never resulting in any of them running. I made sure that neither container in my .yml file has a restart instruction, so I am sure I don't restart myself.

I first tried to remove the garbage with my universal clear command:

docker ps -a | grep 'ted'| awk {'print $1'} |xargs docker rm -v; docker ps -a | grep 'ead'| awk {'print $1'} |xargs docker rm -v

But this does not stop the replay process, it just starts again. So without further ado, I resorted to a series of commands I copied from somewhere else (without understanding the implications), but used successfully several times before:

systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker

This procedure went fine, as expected.

Stepping back

To isolate the problems and gain more understanding, I switched to using the run command and the usual test routines, which should definitely work as expected:

docker run -d --name loop-demo alpine sh -c "while true; do sleep 1; done"
docker run -d --name sleep-demo alpine sleep infinity
docker run -d --name tail-demo alpine tail -f /dev/null
docker run -dt --name tty-demo alpine

I expected these containers to run indefinitely, but they were reliably terminated by docker after 5 minutes:

# docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS                        PORTS               NAMES
5666ba05baf1        alpine              "sh -c 'while true; …"   About a minute ago   Exited (137) 34 seconds ago                       loop-demo
cef06c31d246        alpine              "sleep infinity"         2 minutes ago        Exited (137) 34 seconds ago                       sleep-demo
cd813e81f3c6        alpine              "/bin/sh"                3 minutes ago        Exited (137) 34 seconds ago                       tty-demo
8aa49ec219cd        alpine              "tail -f /dev/null"      5 minutes ago        Exited (137) 33 seconds ago                       tail-demo

This is not expected. Furthermore, the log is incomprehensible to me, for example:

# docker logs cd813e81f3c6
/ #

I tried the same thing with a container in my stack, with the same result in that it only runs for 5 minutes. Well, at least it runs so far and does not stay forever in mode created, in contrast to the deployment as a stack. This is all very unfamiliar and incomprehensible to me. I finally ran out of ideas and humbly seek for help.

Any ideas or insights?

Now my questions are:

did anybody ever experience this kind of behavior
what am I doing wrong
what can I learn from this setup
how can I further investigate this scenario
and how can I make the whole thing run as reliably as before
and lastly, how could this happen in the first place?

Thank you for reading and your effort.

Solution

I put a lot of effort into solving the problem and finally managed it: it was simply and solely my fault, and a very stupid one at that.

I should have taken the regular execution every 5 minutes as hint to look at my cronjob right away. How come?

On this machine, I had increasing problems with hard disk memory shortages and the machine became increasingly cluttered. I diagnosed docker to be the cause, so I took several measures to reclaim disk space.

As a result of these measures, I deleted the containers myself every 5 minutes. Bingo! Congratulations!

However, by reinstalling I have gained a lot of free space, so this problem should not occur again in the future.

Many thanks to everyone who has tried to solve my problem. I take this story as a lesson to look at the right place.