I've got a Python application that polls a queue for new data, and inserts it into a TimescaleDB database (TimescaleDB is an extension for PostgreSQL). This application must stay running at all times.
The problem is, the Python program may fail from time to time, and I expect Docker Swarm to restart the container. However, the containers keep running even after failure. Why aren't my containers failing and then being restarted by Docker Swarm?
The Python app looks something like this:
def main():
try:
conn = get_db_conn()
insert_data(conn)
except Exception:
logger.exception("Error with main inserter.py function")
send_email_if_error()
raise
finally:
try:
conn.close()
del conn
except Exception:
pass
return 0
if __name__ == "__main__":
main()
The Dockerfile looks like this:
FROM python:3.8-slim-buster
# Configure apt and install packages
RUN apt-get update && \
apt-get -y --no-install-recommends install cron nano procps
# Install Python requirements.
RUN pip3 install --upgrade pip && \
pip3 install poetry==1.0.10
COPY poetry.lock pyproject.toml /
RUN poetry config virtualenvs.create false && \
poetry install --no-interaction --no-ansi
# Copy everything to the / folder inside the container
COPY . /
# Make /var/log the default directory in the container
WORKDIR /var/log
# Start Python app on container startup
CMD ["python3", "/inserter/inserter.py"]
Docker-Compose file:
version: '3.7'
services:
inserter13:
# Name and tag of image the Dockerfile creates
image: mccarthysean/ijack:timescale
depends_on:
- timescale13
env_file: .env
environment:
POSTGRES_HOST: timescale13
networks:
- traefik-public
deploy:
# Either global (exactly one container per physical node) or
# replicated (a specified number of containers). The default is replicated
mode: replicated
# For stateless applications using "replicated" mode,
# the total number of replicas to create
replicas: 2
restart_policy:
on-failure # default is 'any'
timescale13:
image: timescale/timescaledb:2.3.0-pg13
volumes:
- type: volume
source: ijack-timescale-db-pg13
target: /var/lib/postgresql/data # the location in the container where the data are stored
read_only: false
# Custom postgresql.conf file will be mounted (see command: as well)
- type: bind
source: ./postgresql_custom.conf
target: /postgresql_custom.conf
read_only: false
env_file: .env
command: ["-c", "config_file=/postgresql_custom.conf"]
ports:
- 0.0.0.0:5432:5432
networks:
traefik-public:
deploy:
# Either global (exactly one container per physical node) or
# replicated (a specified number of containers). The default is replicated
mode: replicated
# For stateless applications using "replicated" mode,
# the total number of replicas to create
replicas: 1
placement:
constraints:
# Since this is for the stateful database,
# only run it on the swarm manager, not on workers
- "node.role==manager"
restart_policy:
condition: on-failure # default is 'any'
# Use a named external volume to persist our data
volumes:
ijack-timescale-db-pg13:
external: true
networks:
# Use the previously created public network "traefik-public", shared with other
# services that need to be publicly available via this Traefik
traefik-public:
external: true
The "Docker-compose.build.yml" file I use for building the "inserter.py" container's image:
version: '3.7'
services:
inserter:
# Name and tag of image the Dockerfile creates
image: mccarthysean/ijack:timescale
build:
# context: where should docker-compose look for the Dockerfile?
# i.e. either a path to a directory containing a Dockerfile, or a url to a git repository
context: .
dockerfile: Dockerfile.inserter
environment:
POSTGRES_HOST: timescale
Bash script I run, which builds, pushes, and deploys the database and the inserter containers with Docker Swarm:
#!/bin/bash
# Build and tag image locally in one step.
# No need for docker tag <image> mccarthysean/ijack:<tag>
echo ""
echo "Building the image locally..."
echo "docker-compose -f docker-compose.build.yml build"
docker-compose -f docker-compose.build.yml build
# Push to Docker Hub
# docker login --username=mccarthysean
echo ""
echo "Pushing the image to Docker Hub..."
echo "docker push mccarthysean/ijack:timescale"
docker push mccarthysean/ijack:timescale
# Deploy to the Docker swarm and send login credentials
# to other nodes in the swarm with "--with-registry-auth"
echo ""
echo "Deploying to the Docker swarm..."
echo "docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13"
docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13
When the Python inserter program fails (could be a database connection issue, or something else), it sends me an email alert and then raises the error and fails. At this point, I expect the Docker container to fail and be restarted with Docker Swarm's restart_policy: on-failure
. However, after an error, when I type docker service ls
I see the following 0/2 replicas
:
ID NAME MODE REPLICAS IMAGE PORTS
u354h0uj4ug6 timescale13_inserter13 replicated 0/2 mccarthysean/ijack:timescale
o0rbfx5n2z4h timescale13_timescale13 replicated 1/1 timescale/timescaledb:2.3.0-pg13 *:5432->5432/tcp
When it's healthy (most of the time), it shows 2/2
replicas. Why aren't my containers failing and then being restarted by Docker Swarm?
I figured it out, and updated my question to provide more details on my try: except:
failure routine.
Here was the error that happened (actually two errors in sequence, as you'll see):
Here's the error information:
Traceback (most recent call last):
File "/inserter/inserter.py", line 357, in execute_sql
cursor.execute(sql, values)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command SSL connection has been closed unexpectedly
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/inserter/inserter.py", line 911, in main
insert_alarm_log_rds(
File "/inserter/inserter.py", line 620, in insert_alarm_log_rds
rc = execute_sql(
File "/inserter/inserter.py", line 364, in execute_sql
conn.rollback()
psycopg2.InterfaceError: connection already closed
As you can see, there was first a psycopg2.errors.AdminShutdown
error, which was raised in my first try: except:
routine. However, this was followed by a second psycopg2.InterfaceError
that actually occurred in my finally:
cleanup code, followed by a pass
statement and a return 0
, so the earlier error was not re-raised, I guess, and the code ended with an error code of 0 instead of non-zero which was needed to stimulate the restart.
@edijon's comment about needing a non-zero exit code was what helped me to figure this out.
I needed to re-raise the error in the finally:
routine, as follows:
def main():
try:
conn = get_db_conn()
insert_data(conn)
except Exception:
logger.exception("Error with main inserter.py function")
send_email_if_error()
raise
finally:
try:
conn.close()
del conn
except Exception:
# previously the following was just 'pass'
# and I changed it to 'raise' to ensure errors
# cause a non-zero error code for Docker's 'restart_policy'
raise
# The following was previously "return 0"
# which caused the container not to restart...
# Either comment it out, or change it to return non-zero
return 1
if __name__ == "__main__":
main()