Search code examples
rhelpodmanrhel8

Podman pod disappears after a few days, but process is still running and listening on a given port


I am running an Elasticsearch container as Podman pod using podman play kube and a yaml definition of a pod. Pod is created, cluster of three nodes is created and everything works as expected. But: Podman pod dies after a few days of staying idle.

Podman podman ps command says:

ERRO[0000] Error refreshing container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: error acquiring lock 1 for container af05fafe31f6bfb00c2599255c47e35813ecf5af9bbe6760ae8a4abffd343627: file exists
ERRO[0000] Error refreshing container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: error acquiring lock 2 for container b4620633d99f156bb59eb327a918220d67145f8198d1c42b90d81e6cc29cbd6b: file exists
ERRO[0000] Error refreshing pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: error retrieving lock 0 for pod 389b0c34313d9b23ecea3faa0e494e28413bd15566d66297efa9b5065e025262: file exists
POD ID        NAME               STATUS   CREATED     INFRA ID      # OF CONTAINERS
389b0c34313d  elasticsearch-pod  Created  1 week ago  af05fafe31f6  2

What's weird is that the process is still listening if we try to find the process id listening on port 9200 or 9300:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0      0 :::9200                 :::*                    LISTEN      1328607/containers-
tcp6       0      0 :::9300                 :::*                    LISTEN      1328607/containers-

The process ID that is hanging (and making the process still listening is):

user+ 1339220  0.0  0.1  45452  8284 ?        S    Jan11   2:19 /bin/slirp4netns --disable-host-loopback --mtu 65520 --enable-sandbox --enable-seccomp -c -e 3 -r 4 --netns-type=path /tmp/run-1002/netns/cni-e4bb2146-d04e-c3f1-9207-380a234efa1f tap0

The only actions I do to the pod is regular: podman pod stop, podman pod rm and podman play kube that is starting pod.

What can be causing such strange behaviour of Podman? What may be causing the lock not to be released properly?

System information:

NAME="Red Hat Enterprise Linux"
VERSION="8.3 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.3"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.3 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.3:GA"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.3"
Red Hat Enterprise Linux release 8.3 (Ootpa)
Red Hat Enterprise Linux release 8.3 (Ootpa)

Podman version:

podman --version
podman version 2.2.1

Solution

  • The workaround that worked for me is to add this configuration file from the Podman repository [1] under /usr/lib/tmpfiles.d/ and /etc/tmpfiles.d/, in this way we are preventing the removal of Podman temporary files from /tmp directory [2]. As stated in [3], additionally CNI leaves Network information in /var/lib/cni/networks when the system crashes or containers do not shut down properly. This behaviour has been fixed in the latest Podman release [4] and it happens when using rootless Podman.

    Workaround

    First, check the runRoot default directory set for your Podman rootless user:

    podman info | grep runRoot
    

    Create the temporary configuration file:

    sudo vim /usr/lib/tmpfiles.d/podman.conf
    

    Add the following content, replacing /tmp/podman-run-* by your default runRoot directory. E.g. If your output is /tmp/run-6695/containers then use: x /tmp/run-*

    # /tmp/podman-run-* directory can contain content for Podman containers that have run
    # for many days. This following line prevents systemd from removing this content.
    x /tmp/podman-run-*
    x /tmp/containers-user-*
    D! /run/podman 0700 root root
    D! /var/lib/cni/networks
    

    Copy the temporary file from /usr/lib/tmpfiles.d to /etc/tmpfiles.d/

    sudo cp -p /usr/lib/tmpfiles.d/podman.conf /etc/tmpfiles.d/
    

    After you have done all the steps according to your configuration, the error should disappear.

    References

    1. https://github.com/containers/podman/blob/master/contrib/tmpfile/podman.conf
    2. https://bugzilla.redhat.com/show_bug.cgi?id=1888988#c9
    3. https://github.com/containers/podman/commit/2e0a9c453b03d2a372a3ab03b9720237e93a067c
    4. https://github.com/containers/podman/pull/8241