I've setup a highly customized virtual environment on Cloud Dataproc. Some of the libraries in this virtual environment depend on certain shared libraries. which are packaged along with the Virtual Environment.
For the Virtual Environment: I made PYSPARK_PYTHON
point to the python present inside the Virtual Environment.
However these libraries are not able to work as the LD_LIBRARY_PATH
is not set when I do gcloud dataproc jobs submit....
I've tried:
spark-env.sh
on the workers and master to export LD_LIBRARY_PATH
spark.executorEnv.LD_LIBRARY_PATH
However both of these fail.
This is what finally worked:
Running the gcloud command as:
gcloud dataproc jobs submit pyspark --cluster spark-tests spark_job.py --properties spark.executorEnv.LD_LIBRARY_PATH="path1:path2"
When I tried to set the spark.executorEnv inside the pyspark script(using the Spark Config object) it didnt work though. I'm not sure why that is?