Search code examples
pythonrpython-3.xsetuptoolsdocker-image

Airflow: Package installation of rpy2 to execute RScripts in Airflow


Requirement: To be able to install the rpy2 library, as the code to be orchestrated with airflow uses it extensively

Current Dockerfile

FROM ubuntu:latest

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends build-essential r-base r-base-core r-cran-randomforest python3.6 python3-pip python3-setuptools python3-dev&& \
   rm -r /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/requirements.txt

RUN pip3 install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==49.6.0

RUN python3 -m pip install rpy2

RUN Rscript -e "install.packages('data.table')"

COPY . /app

Issue: I'm having issues surrounding the necessary libraries, which didn't come up in the code itself.

The Error:

[6/8] RUN python3 -m pip install rpy2:
1.176 Collecting rpy2
1.304   Downloading rpy2-3.5.14.tar.gz (219 kB)
1.422   Installing build dependencies: started
4.186   Installing build dependencies: finished with status 'done'
4.187   Getting requirements to build wheel: started
4.225   Getting requirements to build wheel: finished with status 'error'
4.225   ERROR: Command errored out with exit status 1:
4.225    command: /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmpff4u1mul
4.225        cwd: /tmp/pip-install-12iwr626/rpy2
4.225   Complete output (31 lines):
4.225   Traceback (most recent call last):
4.225     File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pep517/_in_process.py", line 257, in <module>
4.225       main()
4.225     File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pep517/_in_process.py", line 240, in main
4.225       json_out['return_val'] = hook(**hook_input['kwargs'])
4.225     File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pep517/_in_process.py", line 85, in get_requires_for_build_wheel
4.225       backend = _build_backend()
4.225     File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/pep517/_in_process.py", line 63, in _build_backend
4.225       obj = import_module(mod_path)
4.225     File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
4.225       return _bootstrap._gcd_import(name[level:], package, level)
4.225     File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
4.225     File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
4.225     File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
4.225     File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
4.225     File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
4.225     File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
4.225     File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
4.225     File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
4.225     File "<frozen importlib._bootstrap_external>", line 883, in exec_module
4.225     File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
4.225     File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 10, in <module>
4.225       import distutils.core
4.225     File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
4.225     File "<frozen importlib._bootstrap>", line 1002, in _find_and_load_unlocked
4.225     File "<frozen importlib._bootstrap>", line 945, in _find_spec
4.225     File "/usr/local/lib/python3.10/dist-packages/_distutils_hack/__init__.py", line 72, in find_spec
4.225       return self.get_distutils_spec()
4.225     File "/usr/local/lib/python3.10/dist-packages/_distutils_hack/__init__.py", line 77, in get_distutils_spec
4.225       class DistutilsLoader(importlib.util.abc.Loader):
4.225   AttributeError: module 'importlib.util' has no attribute 'abc'

Solution

  • All these errors tend to be issues with different package versions fighting each other. For instance: a package removed a method or moved some functions around in its latest release, and another package that depends on the former is not aware (yet) of those changes.

    As in: Package A uses Package B's .do_something method, but Package B's developers rename it to .do_something_better. If you have the latest version of B, but an old version of A which is not yet aware of the rename... well... it will crash (as you've seen)

    That seems to be what's happening with Python 3.10 and setuptools quite a bit.

    TL;DR: you're seeing a quite common (and pesky) versioning issue.

    This said, this Dockerfile is successfully building:

    FROM ubuntu:latest
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
        r-base r-base-core r-cran-randomforest \
        libinput-dev libgbm-dev liblzma-dev libbz2-dev libicu-dev libblas-dev liblapack-dev \
        python3.6 python3-pip python3-setuptools python3-dev&& \
       rm -r /var/lib/apt/lists/*
    
    WORKDIR /app
    
    COPY requirements.txt /app/requirements.txt
    
    RUN pip3 install --upgrade pip wheel setuptools>51
    
    RUN python3 -m pip install rpy2
    
    RUN Rscript -e "install.packages('data.table')"
    
    COPY . /app
    

    Notice there's a bunch of -dev packages required and that I allowed pip, wheel and setuptools to be a bit more loose when it comes to versions. Also, since I don't have your requirements.txt file, I had to left it blank.

    HOWEVER: You are fetching the :latest Ubuntu image. As of October 2023 that means installing Ubuntu 22.04 (codename "Jammy Jellyfish"). The default Python 3 in that image is intended to be 3.10 yet you seem to be installing Python 3.6. This can lead to potential issues, since if you do some apt-get install some_python_package, you could potentially end up with Python 3.6 in your system, yet a version of some_python_package intended for Python 3.10, which is not great.

    If you'd rather use Python 3.6, may I suggest you base your Dockerfile on one of the Python Docker images?

    For instance, python:3.6.14-bullseye, which is Debian (not Ubuntu) based but contains some tweaks and environment variables geared towards providing a safe environment (or "ecosystem") for Python 3.6

    FROM python:3.6.14-bullseye
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    RUN apt-get update && apt-get install -y --no-install-recommends build-essential \
        r-base r-base-core r-cran-randomforest \
        libinput-dev libgbm-dev liblzma-dev libbz2-dev libicu-dev libblas-dev liblapack-dev \
       && rm -r /var/lib/apt/lists/*
    
    WORKDIR /app
    
    COPY requirements.txt /app/requirements.txt
    
    RUN pip3 install --upgrade pip wheel setuptools
    
    RUN python3 -m pip install rpy2
    
    RUN Rscript -e "install.packages('data.table')"
    
    COPY . /app
    

    There are quite a bit more Python Docker images with slightly different features and contents. You might wanna take a look at this article and see which one best fits your needs.

    Pinning an image to a specific version, rather than to :latest has also the advantage that if (for instance) the Ubuntu Docker image maintainers decide to update what "latest" means from the current 22.04 to, let's say 24.04, your won't be bitten by an unexpected full O.S. upgrade.