Search code examples
dockerapache-sparkkubernetespysparkcrashloopbackoff

What could cause my spark history server start but then pod is completed immediately and crash in CrashLoopBackOff


To start with this is bit of context: in my cluster kubernetes there is spark app that is running and I want to add a deployment to start the spark history server that will read the logs generated by that app on a shared volume.

For some security measure in the project I can't use image of spark operator directly in my dockerfile. So I install spark via a conda env and pyspark in my dockerfile. I also export the env var ENV SPARK_HISTORY_OPTS instead of the config file as they should be the same.

SPARK_HISTORY_OPTS='-Dspark.history.fs.logDirectory=/execution-events -Dspark.eventLog.dir=/execution-events -Dspark.eventLog.enabled=true -Dspark.history.fs.cleaner.enabled=true -Dspark.history.ui.port=4039'

the shared volume that is mount on the deployment has the same path /execution-event

In my custom entrypoint.sh file there is a few steps,

- export the spark home
- start the spark history server with a simple: exec /usr/bin/tini -s -- $SPARK_HOME/sbin/start-history-server.sh

When I watch the deployment being created, the pod starts the server but then it die on the completed state and restarts in CrashLoopBackOff which is something I don't understand.

The spark history server should stay alive until I execute the stop-history-server.sh script, so why can't it stay alive ?

Thank for the futur answers.

PS: When I add a sleep of around 5 mins to debug and manually in ssh the pod, and start the server I can see the message: spark history server starts.

And I can see in the logs folder that the files are created.

This is the message in log of the pod:

+ exec /usr/bin/tini -s -- /opt/conda/envs/spark-env-3.1.2/lib/python3.7/site-packages/pyspark/sbin/start-history-server.sh                                                                                     │
│ starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/conda/envs/spark-env-3.1.2/lib/python3.7/site-packages/pyspark/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-spark-histor │
│ Stream closed EOF for ***NAMESPACE***/spark-history-deployment-65dd4dd6f5-wk27t (spark-history-container)

Solution

  • The problem was something I found recently, in the entrypoint.sh file where I start the spark-history-server.sh script I need to set a env var used by the daemon script to not be used in background but in foreground to keep the pod alive.

    to add before the execution of start-history-server.sh

    export SPARK_NO_DAEMONIZE=false

    Hope it will help futur guys/girls with same the problem.