Search code examples
apache-sparkpysparkhadoop-yarnrddgoogle-cloud-dataproc

Could not find valid SPARK_HOME on dataproc


Spark job executed by Dataproc cluster on Google Cloud gets stuck on a task PythonRDD.scala:446

The error log says Could not find valid SPARK_HOME while searching ... paths under /hadoop/yarn/nm-local-dir/usercache/root/

The thing is, SPARK_HOME should be set by default on a dataproc cluster. Other spark jobs that don't use RDDs work just fine.

During the initialization of the cluster I do not reinstall spark (but I have tried to, which I previously thought caused the issue).

I also found out that all my executors were removed after a minute of running the task.

And yes, I have tried to run the following initialization action and it didn't help:

#!/bin/bash

cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF

Any help?


Solution

  • I was using a custom mapping function. When I put the function to a separate file the problem disappeared.