Cannot import pyspark from a pipenv virtualenv as it cannot find py4j

I have built a docker image containing spark and pipenv. If I run python within the pipenv virtualenv and attempt to import pyspark, it fails with error "ModuleNotFoundError: No module named 'py4j'"

root@4d0ae585a52a:/tmp# pipenv run python -c "from pyspark.sql import SparkSession"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/spark/python/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File "/opt/spark/python/pyspark/context.py", line 29, in <module>
    from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'

However, if I run pyspark within that same virtualenv there are no such problems:

root@4d0ae585a52a:/tmp# pipenv run pyspark
Python 3.7.4 (default, Sep 12 2019, 16:02:06) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 10:18:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 10:18:33 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.1
      /_/

Using Python version 3.7.4 (default, Sep 12 2019 16:02:06)
SparkSession available as 'spark'.
>>> spark.createDataFrame([('Alice',)], ['name']).collect()
[Row(name='Alice')]

I admit I copied alot of the code for my Dockerfile from elsewhere so I'm not fully au fait with how this hangs together under the covers. I was hoping having py4j on the PYTHONPATH would be enough, but apparently not. I can confirm that it is there on the PYTHONPATH and that it exists:

root@4d0ae585a52a:/tmp# pipenv run python -c "import os;print(os.environ['PYTHONPATH'])"
/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:
root@4d0ae585a52a:/tmp# pipenv run ls /opt/spark/python/lib/py4j*
/opt/spark/python/lib/py4j-0.10.4-src.zip

Can anyone suggest what I can do to make py4j available to my python interpreter in my virtualenv?

Here is the Dockerfile. We pull artifacts (docker images, apt packages, pypi packages etc) from our local jfrog artifactory cache, hence all the artifactory references herein:

FROM images.artifactory.our.org.com/python3-7-pipenv:1.0

WORKDIR /tmp

ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4

ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz


ADD apt-transport-https_1.4.8_amd64.deb /tmp

RUN echo "deb https://username:[email protected]/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
    echo "deb https://username:[email protected]/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
    echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
    echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
    echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
    dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
    apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
    apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"


RUN apt-get update && \
    apt-get -y install default-jdk

# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
#   export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc

# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && cd /opt \
    && tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
    && sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
    && mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
    && mkdir /opt/spark-optional-jars/ \
    && mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
    && printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
    && printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf

# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && cd /opt \
    && tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && ln -s hadoop-${HADOOP_VERSION} hadoop

# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
    HADOOP_HOME="/opt/hadoop" \
    PYSPARK_SUBMIT_ARGS="--master=lo    cal[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
    PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH" \
    PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
    PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
    PYSPARK_PYTHON="/usr/local/bin/python"

# Upgrade pip and setuptools
RUN pip install --index-url https://username:[email protected]/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools

Solution

I think I've managed to get around this simply by installing py4j standalone:

$ docker run --rm -it images.artifactory.our.org.com/myimage:mytag bash
root@1d6a0ec725f0:/tmp# pipenv install py4j
Installing py4j…
✔ Installation Succeeded 
Pipfile.lock (49f1d8) out of date, updating to (dfdbd6)…
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
✔ Success! 
Updated Pipfile.lock (49f1d8)!
Installing dependencies from Pipfile.lock (49f1d8)…
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 42/42 — 00:00:06
To activate this projects virtualenv, run pipenv shell.
Alternatively, run a command inside the virtualenv with pipenv run.
root@1d6a0ec725f0:/tmp# pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 13:05:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 13:05:48 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]
root@1d6a0ec725f0:/tmp#

Not entirely sure why I have to given py4j is already on the PYTHONPATH, but so far it seems OK, so I'm happy. If anyone can elucidate why it didn't work without explicitly installing py4j, I'd love to know. I can only assume that this line from my Dockerfile:

PYTHONPATH="/opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH"

does not successfully make py4j known to the interpreter.

Just to confirm (in case it helps) where pip thinks py4j & pyspark are installed to:

root@1d6a0ec725f0:/tmp# pipenv run pip show pyspark
Name: pyspark
Version: 2.2.1
Summary: Apache Spark Python API
Home-page: https://github.com/apache/spark/tree/master/python
Author: Spark Developers
Author-email: [email protected]
License: http://www.apache.org/licenses/LICENSE-2.0
Location: /opt/spark-2.2.1-bin-hadoop2.7/python
Requires: py4j
Required-by: 
root@1d6a0ec725f0:/tmp# pipenv run pip show py4j
Name: py4j
Version: 0.10.8.1
Summary: Enables Python programs to dynamically access arbitrary Java objects
Home-page: https://www.py4j.org/
Author: Barthelemy Dagenais
Author-email: [email protected]
License: BSD License
Location: /root/.local/share/virtualenvs/tmp-XVr6zr33/lib/python3.7/site-packages
Requires: 
Required-by: pyspark
root@1d6a0ec725f0:/tmp#

Another solution, unzip the py4j zip file as part of the Dockerfile stage that installs spark, and then set PYTHONPATH accordingly:

unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/
...
...
PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH"

This feels like the best solution actually. Here's the new Dockerfile:

FROM images.artifactory.our.org.com/python3-7-pipenv:1.0

WORKDIR /tmp

ENV SPARK_VERSION=2.2.1
ENV HADOOP_VERSION=2.8.4

ARG ARTIFACTORY_USER
ARG ARTIFACTORY_ENCRYPTED_PASSWORD
ARG ARTIFACTORY_PATH=artifactory.our.org.com/artifactory/generic-dev/ceng/external-dependencies
ARG SPARK_BINARY_PATH=https://${ARTIFACTORY_PATH}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz
ARG HADOOP_BINARY_PATH=https://${ARTIFACTORY_PATH}/hadoop-${HADOOP_VERSION}.tar.gz


ADD apt-transport-https_1.4.8_amd64.deb /tmp

RUN echo "deb https://username:[email protected]/artifactory/debian-main-remote stretch main" >/etc/apt/sources.list.d/main.list &&\
    echo "deb https://username:[email protected]/artifactory/maria-db-debian stretch main" >>/etc/apt/sources.list.d/main.list &&\
    echo 'Acquire::CompressionTypes::Order:: "gz";' > /etc/apt/apt.conf.d/02update &&\
    echo 'Acquire::http::Timeout "10";' > /etc/apt/apt.conf.d/99timeout &&\
    echo 'Acquire::ftp::Timeout "10";' >> /etc/apt/apt.conf.d/99timeout &&\
    dpkg -i /tmp/apt-transport-https_1.4.8_amd64.deb &&\
    apt-get install --allow-unauthenticated -y /tmp/apt-transport-https_1.4.8_amd64.deb &&\
    apt-get update --allow-unauthenticated -y -o Dir::Etc::sourcelist="sources.list.d/main.list" -o Dir::Etc::sourceparts="-" -o APT::Get::List-Cleanup="0"


RUN apt-get update && \
    apt-get -y install default-jdk

# Detect JAVA_HOME and export in bashrc.
# This will result in something like this being added to /etc/bash.bashrc
#   export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo export JAVA_HOME="$(readlink -f /usr/bin/java | sed "s:/jre/bin/java::")" >> /etc/bash.bashrc

# Configure Spark-${SPARK_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${SPARK_BINARY_PATH}" -o /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && cd /opt \
    && tar -xzf /opt/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && rm spark-${SPARK_VERSION}-bin-hadoop2.7.tgz \
    && ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark \
    && unzip spark/python/lib/py4j-*-src.zip -d spark/python/lib/ \
    && sed -i '/log4j.rootCategory=INFO, console/c\log4j.rootCategory=CRITICAL, console' /opt/spark/conf/log4j.properties.template \
    && mv /opt/spark/conf/log4j.properties.template /opt/spark/conf/log4j.properties \
    && mkdir /opt/spark-optional-jars/ \
    && mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf \
    && printf "spark.driver.extraClassPath /opt/spark-optional-jars/*\nspark.executor.extraClassPath /opt/spark-optional-jars/*\n">>/opt/spark/conf/spark-defaults.conf \
    && printf "spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby" >> /opt/spark/conf/spark-defaults.conf

# Configure Hadoop-${HADOOP_VERSION}
# Not using tar -v because including verbose output causes ci logsto exceed max length
RUN curl --fail -u "${ARTIFACTORY_USER}:${ARTIFACTORY_ENCRYPTED_PASSWORD}" -X GET "${HADOOP_BINARY_PATH}" -o /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && cd /opt \
    && tar -xzf /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && rm /opt/hadoop-${HADOOP_VERSION}.tar.gz \
    && ln -s hadoop-${HADOOP_VERSION} hadoop

# Set Environment Variables.
ENV SPARK_HOME="/opt/spark" \
    HADOOP_HOME="/opt/hadoop" \
    PYSPARK_SUBMIT_ARGS="--master=local[*] pyspark-shell --executor-memory 1g --driver-memory 1g --conf spark.ui.enabled=false spark.executor.extrajavaoptions=-Xmx=1024m" \
    PYTHONPATH="/opt/spark/python:/opt/spark/python/lib:$PYTHONPATH" \
    PATH="$PATH:/opt/spark/bin:/opt/hadoop/bin" \
    PYSPARK_DRIVER_PYTHON="/usr/local/bin/python" \
    PYSPARK_PYTHON="/usr/local/bin/python"

# Upgrade pip and setuptools
RUN pip install --index-url https://username:[email protected]/artifactory/api/pypi/pypi-virtual-all/simple --upgrade pip setuptools

So apparently I cannot put a zip file on the PYTHONPATH and have the contents of that zip file available to the python interpreter. As I said above I copied that code from elsewhere so why it worked for someone else and not me...I have no idea. Oh well, everything is working now.

Here is a nice single command to check that it all works:

docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"

Here's my output from running that command:

$ docker run --rm -it myimage:mytag pipenv run python -c "from pyspark.sql import SparkSession;spark = SparkSession.builder.master('local').enableHiveSupport().getOrCreate();print(spark.createDataFrame([('Alice',)], ['name']).collect())"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/10/16 15:53:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/16 15:53:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
19/10/16 15:53:55 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
19/10/16 15:53:56 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
[Row(name='Alice')]