I'm tring to build dockerfile to use airflow and spark as follows
FROM apache/airflow:2.7.0-python3.9
ENV AIRFLOW_HOME=/opt/airflow
USER root
# Update the package list, install required packages, and clean up
RUN apt-get update && \
apt-get install -y gcc python3-dev openjdk-11-jdk wget && \
apt-get clean
# Set the JAVA_HOME environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
COPY requirements.txt .
USER airflow
RUN pip install -U pip
RUN pip install --no-cache-dir -r requirements.txt
My requirements.txt is
apache-airflow
apache-airflow-providers-apache-spark
apache-airflow-providers-celery>=3.3.0
apache-airflow-providers-google
pandas
psycopg2-binary
pytest
pyspark
requests
sqlalchemy
And it would take extremely long time to build and I kept getting info as below
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime.
=> => # Downloading google_cloud_workflows-1.16.0-py2.py3-none-any.whl.metadata (5.2 kB)
And if I remove python3.9 in the first line of my dockerfile, then I'm unable to install openjdk-11-jdk.
Does anyone know how to solve it, thank you
Try Using Airflow's official constraints file - The file. The constraints file contains pre-computed compatible dependencies for Airflow, which drastically reduces pip's need to calculate dependencies on its own.
FROM apache/airflow:2.7.0-python3.9
ENV AIRFLOW_HOME=/opt/airflow
USER root
# Update the package list, install required packages, and clean up
RUN apt-get update && \
apt-get install -y gcc python3-dev openjdk-11-jdk wget && \
apt-get clean
# Set the JAVA_HOME environment variable
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
COPY requirements.txt .
USER airflow
RUN pip install --upgrade pip
# Use pip's constraint mode to avoid backtracking
RUN pip install --no-cache-dir --use-pep517 --constraint=https://raw.githubusercontent.com/apache/airflow/constraints-2.7.0/constraints-3.9.txt -r requirements.txt
The requiremnets
apache-airflow
apache-airflow-providers-apache-spark
apache-airflow-providers-celery>=3.3.0
apache-airflow-providers-google==10.1.0
pandas
psycopg2-binary
pytest
pyspark
requests
sqlalchemy