Search code examples
pysparkgoogle-cloud-platformgoogle-cloud-dataproc

Using LD_LIBRARY_PATH in Cloud Dataproc Pyspark


I've setup a highly customized virtual environment on Cloud Dataproc. Some of the libraries in this virtual environment depend on certain shared libraries. which are packaged along with the Virtual Environment.

For the Virtual Environment: I made PYSPARK_PYTHON point to the python present inside the Virtual Environment.

However these libraries are not able to work as the LD_LIBRARY_PATH is not set when I do gcloud dataproc jobs submit....

I've tried:

  1. Setting spark-env.sh on the workers and master to export LD_LIBRARY_PATH
  2. Setting spark.executorEnv.LD_LIBRARY_PATH
  3. Creating an initialization script where (1) is being added during initialization

However both of these fail.


Solution

  • This is what finally worked:

    Running the gcloud command as:

    gcloud dataproc jobs submit pyspark --cluster spark-tests spark_job.py --properties spark.executorEnv.LD_LIBRARY_PATH="path1:path2" 
    

    When I tried to set the spark.executorEnv inside the pyspark script(using the Spark Config object) it didnt work though. I'm not sure why that is?