Spark job executed by Dataproc cluster on Google Cloud gets stuck on a task PythonRDD.scala:446
The error log says Could not find valid SPARK_HOME while searching
... paths under /hadoop/yarn/nm-local-dir/usercache/root/
The thing is, SPARK_HOME should be set by default on a dataproc cluster. Other spark jobs that don't use RDDs work just fine.
During the initialization of the cluster I do not reinstall spark (but I have tried to, which I previously thought caused the issue).
I also found out that all my executors were removed after a minute of running the task.
And yes, I have tried to run the following initialization action and it didn't help:
#!/bin/bash
cat << EOF | tee -a /etc/profile.d/custom_env.sh /etc/*bashrc >/dev/null
export SPARK_HOME=/usr/lib/spark/
EOF
Any help?
I was using a custom mapping function. When I put the function to a separate file the problem disappeared.