Search code examples
pythondockerapache-sparkairflowjava-11

Create dockerfile to use airflow and spark, pip backtracking runtime issue comes out


I'm tring to build dockerfile to use airflow and spark as follows

FROM apache/airflow:2.7.0-python3.9

ENV AIRFLOW_HOME=/opt/airflow

USER root

# Update the package list, install required packages, and clean up
RUN apt-get update && \
    apt-get install -y gcc python3-dev openjdk-11-jdk  wget && \
    apt-get clean

# Set the JAVA_HOME environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

COPY requirements.txt .

USER airflow
RUN pip install -U pip
RUN pip install --no-cache-dir -r requirements.txt

My requirements.txt is

apache-airflow
apache-airflow-providers-apache-spark
apache-airflow-providers-celery>=3.3.0
apache-airflow-providers-google
pandas
psycopg2-binary
pytest
pyspark
requests
sqlalchemy

And it would take extremely long time to build and I kept getting info as below

INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime.
 => => #   Downloading google_cloud_workflows-1.16.0-py2.py3-none-any.whl.metadata (5.2 kB)

And if I remove python3.9 in the first line of my dockerfile, then I'm unable to install openjdk-11-jdk.

Does anyone know how to solve it, thank you


Solution

  • Try Using Airflow's official constraints file - The file. The constraints file contains pre-computed compatible dependencies for Airflow, which drastically reduces pip's need to calculate dependencies on its own.

    FROM apache/airflow:2.7.0-python3.9
    
    ENV AIRFLOW_HOME=/opt/airflow
    
    USER root
    
    # Update the package list, install required packages, and clean up
    RUN apt-get update && \
        apt-get install -y gcc python3-dev openjdk-11-jdk wget && \
        apt-get clean
    
    # Set the JAVA_HOME environment variable
    ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
    
    COPY requirements.txt .
    
    USER airflow
    RUN pip install --upgrade pip
    # Use pip's constraint mode to avoid backtracking
    RUN pip install --no-cache-dir --use-pep517 --constraint=https://raw.githubusercontent.com/apache/airflow/constraints-2.7.0/constraints-3.9.txt -r requirements.txt
    

    The requiremnets

    apache-airflow
    apache-airflow-providers-apache-spark
    apache-airflow-providers-celery>=3.3.0
    apache-airflow-providers-google==10.1.0
    pandas
    psycopg2-binary
    pytest
    pyspark
    requests
    sqlalchemy