Search code examples
pythonbashdockershellcron

Cron job for Scrapy Project within Docker container fails to connect to PostgresSQL database in different Docker container


So 2 of my docker containers for this project are a Python image running a Scrapy project and a Postgres image.

docker-compose.yml

version: '3.8'
services:
  app:
    container_name: app
    build:
      context: ./app
      dockerfile: dockerfile
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
      - POSTGRES_HOST=${POSTGRES_HOST}
      - POSTGRES_PORT=${POSTGRES_PORT}
      - MAILTO=${MAILTO}
    depends_on:
      - db

  db:
    container_name: db
    build: 
      context: ./db
      dockerfile: dockerfile
      args:
        POSTGRES_USER: ${POSTGRES_USER}
        POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
        POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "${POSTGRES_PORT}:${POSTGRES_PORT}"

  admin:
    container_name: admin
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=${PGADMIN_DEFAULT_EMAIL}
      - PGADMIN_DEFAULT_PASSWORD=${PGADMIN_DEFAULT_PASSWORD}
    ports:
      - "8888:80"
    depends_on:
      - db

  visualizer:
    container_name: visualizer
    image: grafana/grafana
    ports:
      - "3000:3000"
    depends_on:
      - db

dockerfile for app

FROM python:3.10-bookworm
RUN apt-get update -q 
RUN apt-get install -y cron
COPY . .
RUN pip3 install -r requirements.txt
COPY shell_scripts/scrape_cron /etc/cron.d/scrape_cron
RUN chmod 0744 /etc/cron.d/scrape_cron
RUN crontab /etc/cron.d/scrape_cron 
RUN touch /var/log/cron.log
CMD cron && tail -f /var/log/cron.log

dockerfile for db

FROM postgres:15.0
USER postgres
ARG POSTGRES_USER
ARG POSTGRES_PASSWORD
ARG POSTGRES_DB
ENV POSTGRES_USER=$POSTGRES_USER
ENV POSTGRES_PASSWORD=$POSTGRES_PASSWORD
ENV POSTGRES_DB=$POSTGRES_DB
RUN pg_createcluster 15 main && \
    /etc/init.d/postgresql start && \
    psql --command "CREATE ROLE $POSTGRES_USER WITH SUPERUSER PASSWORD '$POSTGRES_PASSWORD';" && \
    createdb -O $POSTGRES_USER $POSTGRES_DB
EXPOSE 5432
CMD ["postgres"]

The Scrapy project within the app container connects to the database in the db container though a standard psycopg connection.

pipeline.py

hostname = os.environ.get('POSTGRES_HOST', "Hostname not found")
username = os.environ.get('POSTGRES_USER', "Username not found")
password = os.environ.get('POSTGRES_PASSWORD', "Password not found")
database = os.environ.get('POSTGRES_DB', "Database name not found")
port = os.environ.get('POSTGRES_PORT', "Port not found")

logging.debug("Connecting to database...")

try:
    self.connection = psycopg.connect(host=hostname, user=username, password=password, dbname=database, port=port)
    self.cursor = self.connection.cursor()
            logging.info("Connected to database.")
except:
    logging.error("Could not connect to database.")
    raise

The issue is occurring with the crontab I implemented in order to automate the project.

cron

30 5 * * 0 sh /shell_scripts/scrape.sh 

scrape.sh

#!bin/bash
export PATH=$PATH:/usr/local/bin
export POSTGRES_USER=$POSTGRES_USER 
export POSTGRES_PASSWORD=$POSTGRES_PASSW
export POSTGRES_DB=$POSTGRES_DB
export POSTGRES_HOST=$POSTGRES_HOST 
export POSTGRES_PORT=$POSTGRES_PORT 
cd "/scrape"
scrapy crawl spider

It took me a while to get this far with the corn. However, when it activates, the shell script executes successfully, but my Scrapy program fails to establish the database connection with the following message:

CRITICAL: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 134, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python3.10/site-packages/scrapy/crawler.py", line 148, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python3.10/site-packages/scrapy/core/engine.py", line 99, in __init__
    self.scraper = Scraper(crawler)
  File "/usr/local/lib/python3.10/site-packages/scrapy/core/scraper.py", line 109, in __init__
    self.itemproc: ItemPipelineManager = itemproc_cls.from_crawler(crawler)
  File "/usr/local/lib/python3.10/site-packages/scrapy/middleware.py", line 67, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.10/site-packages/scrapy/middleware.py", line 44, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/usr/local/lib/python3.10/site-packages/scrapy/utils/misc.py", line 194, in create_instance
    instance = objcls(*args, **kwargs)
  File "/scrape/scrape/pipelines.py", line 26, in __init__
    self.connection = psycopg.connect(host=hostname, user=username, password=password, dbname=database, port=port)
  File "/usr/local/lib/python3.10/site-packages/psycopg/connection.py", line 738, in connect
    raise ex.with_traceback(None)
psycopg.OperationalError: connection is bad: No such file or directory
    Is the server running locally and accepting connections on that socket?

I believe this issue is caused by crontab using a different shell to process the job. Everything runs successfully when I run the shell script manually through the terminal. It seems that because of the crontab shell, the shell or the program no longer recognizes the Docker network that connects the services and believes that it's looking for something local.

I'm not sure how to solve this issue. I knew using Cron within Docker was tricky but this has been kinda nightmarish.

Values for environment variables:

  • POSTGRES_USER = username
  • POSTGRES_PASSWORD = password for the user
  • POSTGRES_DB = name of the database
  • POSTGRES_HOST = db (Using the name of the service for the hostname)
  • POSTGRES_PORT = 5432

Solution

  • Finally solved the problem. Due to cron running in its own environment, I had to pass the values of the environment variables directly into the shell script in the form of build arguments through the following changes:

    docker-compose.yml

    version: '3.8'
    services:
      app:
        container_name: app
        build:
          context: ./app
          dockerfile: dockerfile
          args:
            - POSTGRES_USER=${POSTGRES_USER}
            - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
            - POSTGRES_DB=${POSTGRES_DB}
            - POSTGRES_HOST=${POSTGRES_HOST}
            - POSTGRES_PORT=${POSTGRES_PORT}
        environment:
          - POSTGRES_USER=${POSTGRES_USER}
          - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
          - POSTGRES_DB=${POSTGRES_DB}
          - POSTGRES_HOST=${POSTGRES_HOST}
          - POSTGRES_PORT=${POSTGRES_PORT}
        depends_on:
          - db
        links:
          - db
    

    dockerfile for app

    FROM python:3.10-bookworm
    RUN apt-get update -q 
    RUN apt-get install -y cron
    COPY . .
    ARG POSTGRES_USER
    ARG POSTGRES_PASSWORD
    ARG POSTGRES_DB
    ARG POSTGRES_HOST
    ARG POSTGRES_PORT
    RUN sed -i "s/\$POSTGRES_USER/${POSTGRES_USER}/g" shell_scripts/scrape.sh
    RUN sed -i "s/\$POSTGRES_PASSWORD/${POSTGRES_PASSWORD}/g" shell_scripts/scrape.sh
    RUN sed -i "s/\$POSTGRES_DB/${POSTGRES_DB}/g" shell_scripts/scrape.sh
    RUN sed -i "s/\$POSTGRES_HOST/${POSTGRES_HOST}/g" shell_scripts/scrape.sh
    RUN sed -i "s/\$POSTGRES_PORT/${POSTGRES_PORT}/g" shell_scripts/scrape.sh
    RUN pip3 install -r requirements.txt
    COPY shell_scripts/scrape_cron /etc/cron.d/scrape_cron
    RUN chmod 0744 /etc/cron.d/scrape_cron
    RUN crontab /etc/cron.d/scrape_cron 
    RUN touch /var/log/cron.log
    CMD cron && tail -f /var/log/cron.log