apache-spark pyspark package python-venv spark-submit

spark-submit python packages with venv cannot run program

I was following this article to encapsule the fuzzy-c-means lib to run on a spark cluster, I'm using bitnami/spark image on docker. I've used a python image to build a venv with python 3.7 and install the fuzzy-c-means lib. then i used the venv-pack to compress the venv in a environment.tar.gz file.

I have a app.py file:

from pyspark.sql import SparkSession


def main(spark):
    import fcmeans
    print('-')


if __name__ == "__main__":
    print('log')
    spark = (
        SparkSession.builder
        .getOrCreate()
   )
    main(spark)

So when I run my spark-submit code I got the error: Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory.

spark-sumit code:

PYSPARK_PYTHON=./environment/bin/python spark-submit --archives ./environment.tar.gz#environment ./app.py

I can run the app.py with the .tar.gz file if I remove the statement PYSPARK_PYTHON but I'll have the no module named 'fcmeans' for the import in my app.py.

The thing is, when run --archives ./environment.tar.gz#environment it unpack the tar.gz files in the /tmp/spark-uuid-code/userFiles-uuid-code/environment/ And when i set the PYSPARK_PYTHON it not recongnizes the path to the file has a valid file, but it seens that the spark should manage this.

Any hints of what I should do?

Solution

I've managed to make it work by creating the virtualenv inside the EMR cluster, then exporting the .tar.gz file with venv-pack to a S3 bucket. This article helped: gist.github.

Inside the EMR shell:

# Create and activate our virtual environment
virtualenv -p python3 venv-datapeeps
source ./venv-datapeeps/bin/activate

# Upgrade pip and install a couple libraries
pip3 install --upgrade pip
pip3 install fuzzy-c-means boto3 venv-pack

# Package the environment and upload
venv-pack -o pyspark_venv.tar.gz
aws s3 cp pyspark_venv.tar.gz s3://<BUCKET>/artifacts/pyspark/