python docker docker-compose docker-swarm

Python app in Docker container doesn't stop/remove Docker container when app fails

I've got a Python application that polls a queue for new data, and inserts it into a TimescaleDB database (TimescaleDB is an extension for PostgreSQL). This application must stay running at all times.

The problem is, the Python program may fail from time to time, and I expect Docker Swarm to restart the container. However, the containers keep running even after failure. Why aren't my containers failing and then being restarted by Docker Swarm?

The Python app looks something like this:

def main():
    try:
        conn = get_db_conn()
        insert_data(conn)
    except Exception:
        logger.exception("Error with main inserter.py function")
        send_email_if_error()
        raise
    finally:
        try:
            conn.close()
            del conn
        except Exception:
            pass

        return 0


if __name__ == "__main__":
    main()

The Dockerfile looks like this:

FROM python:3.8-slim-buster

# Configure apt and install packages
RUN apt-get update && \
    apt-get -y --no-install-recommends install cron nano procps

# Install Python requirements.
RUN pip3 install --upgrade pip && \
    pip3 install poetry==1.0.10

COPY poetry.lock pyproject.toml /
RUN poetry config virtualenvs.create false && \
  poetry install --no-interaction --no-ansi

# Copy everything to the / folder inside the container
COPY . /

# Make /var/log the default directory in the container
WORKDIR /var/log

# Start Python app on container startup
CMD ["python3", "/inserter/inserter.py"]

Docker-Compose file:

version: '3.7'
services:
  inserter13:
    # Name and tag of image the Dockerfile creates
    image: mccarthysean/ijack:timescale
    depends_on: 
      - timescale13
    env_file: .env
    environment: 
      POSTGRES_HOST: timescale13
    networks:
      - traefik-public
    deploy:
      # Either global (exactly one container per physical node) or
      # replicated (a specified number of containers). The default is replicated
      mode: replicated
      # For stateless applications using "replicated" mode,
      # the total number of replicas to create
      replicas: 2
      restart_policy:
        on-failure # default is 'any'

  timescale13:
    image: timescale/timescaledb:2.3.0-pg13
    volumes: 
      - type: volume
        source: ijack-timescale-db-pg13
        target: /var/lib/postgresql/data # the location in the container where the data are stored
        read_only: false
      # Custom postgresql.conf file will be mounted (see command: as well)
      - type: bind
        source: ./postgresql_custom.conf
        target: /postgresql_custom.conf
        read_only: false
    env_file: .env
    command: ["-c", "config_file=/postgresql_custom.conf"]
    ports:
      - 0.0.0.0:5432:5432
    networks:
      traefik-public:
    deploy:
      # Either global (exactly one container per physical node) or
      # replicated (a specified number of containers). The default is replicated
      mode: replicated
      # For stateless applications using "replicated" mode,
      # the total number of replicas to create
      replicas: 1
      placement:
        constraints:
          # Since this is for the stateful database,
          # only run it on the swarm manager, not on workers
          - "node.role==manager"
      restart_policy:
        condition: on-failure # default is 'any'


# Use a named external volume to persist our data
volumes:
  ijack-timescale-db-pg13:
    external: true

networks:
  # Use the previously created public network "traefik-public", shared with other
  # services that need to be publicly available via this Traefik
  traefik-public:
    external: true

The "Docker-compose.build.yml" file I use for building the "inserter.py" container's image:

version: '3.7'
services:
  inserter:
    # Name and tag of image the Dockerfile creates
    image: mccarthysean/ijack:timescale
    build:
      # context: where should docker-compose look for the Dockerfile?
      # i.e. either a path to a directory containing a Dockerfile, or a url to a git repository
      context: .
      dockerfile: Dockerfile.inserter
    environment: 
      POSTGRES_HOST: timescale

Bash script I run, which builds, pushes, and deploys the database and the inserter containers with Docker Swarm:

#!/bin/bash

# Build and tag image locally in one step. 
# No need for docker tag <image> mccarthysean/ijack:<tag>
echo ""
echo "Building the image locally..."
echo "docker-compose -f docker-compose.build.yml build"
docker-compose -f docker-compose.build.yml build

# Push to Docker Hub
# docker login --username=mccarthysean
echo ""
echo "Pushing the image to Docker Hub..."
echo "docker push mccarthysean/ijack:timescale"
docker push mccarthysean/ijack:timescale

# Deploy to the Docker swarm and send login credentials 
# to other nodes in the swarm with "--with-registry-auth"
echo ""
echo "Deploying to the Docker swarm..."
echo "docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13"
docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13

When the Python inserter program fails (could be a database connection issue, or something else), it sends me an email alert and then raises the error and fails. At this point, I expect the Docker container to fail and be restarted with Docker Swarm's restart_policy: on-failure. However, after an error, when I type docker service ls I see the following 0/2 replicas:

ID                  NAME                                        MODE                REPLICAS            IMAGE                                         PORTS
u354h0uj4ug6        timescale13_inserter13                      replicated          0/2                 mccarthysean/ijack:timescale
o0rbfx5n2z4h        timescale13_timescale13                     replicated          1/1                 timescale/timescaledb:2.3.0-pg13              *:5432->5432/tcp

When it's healthy (most of the time), it shows 2/2 replicas. Why aren't my containers failing and then being restarted by Docker Swarm?

Solution

I figured it out, and updated my question to provide more details on my try: except: failure routine.

Here was the error that happened (actually two errors in sequence, as you'll see):

Here's the error information: 
Traceback (most recent call last):
  File "/inserter/inserter.py", line 357, in execute_sql
    cursor.execute(sql, values)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command SSL connection has been closed unexpectedly


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/inserter/inserter.py", line 911, in main
    insert_alarm_log_rds(
  File "/inserter/inserter.py", line 620, in insert_alarm_log_rds
    rc = execute_sql(
  File "/inserter/inserter.py", line 364, in execute_sql
    conn.rollback()
psycopg2.InterfaceError: connection already closed

As you can see, there was first a psycopg2.errors.AdminShutdown error, which was raised in my first try: except: routine. However, this was followed by a second psycopg2.InterfaceError that actually occurred in my finally: cleanup code, followed by a pass statement and a return 0, so the earlier error was not re-raised, I guess, and the code ended with an error code of 0 instead of non-zero which was needed to stimulate the restart.

@edijon's comment about needing a non-zero exit code was what helped me to figure this out.

I needed to re-raise the error in the finally: routine, as follows:

def main():
    try:
        conn = get_db_conn()
        insert_data(conn)
    except Exception:
        logger.exception("Error with main inserter.py function")
        send_email_if_error()
        raise
    finally:
        try:
            conn.close()
            del conn
        except Exception:
            # previously the following was just 'pass' 
            # and I changed it to 'raise' to ensure errors
            # cause a non-zero error code for Docker's 'restart_policy'
            raise

        # The following was previously "return 0"
        # which caused the container not to restart...
        # Either comment it out, or change it to return non-zero
        return 1


if __name__ == "__main__":
    main()