apache-spark hadoop-yarn apache-zeppelin

Spark (Yarn) applications started by Zeppelin in Yarn Cluster Mode aren't killed after zeppein is stopped

I'm running Zeppelin 0.8.1 and have configured it to submit Spark jobs to a Yarn 2.7.5 cluster, with interpreters both in cluster-mode (as in the AM is running on yarn, and not on driver host), and in client-mode.

The yarn applications started in client mode are immediately killed after I stop the Zeppelin server. But, the jobs started in cluster mode become zombie-like, and start hogging all the resources in the Yarn cluster (No dynamic resource allocation).

Is there a way to make zeppelin kill those jobs upon exit? Or anything that solves this problem?

Solution

Starting from version 0.8, Zeppelin provides a parameter to shutdown idle interpreters by setting zeppelin.interpreter.lifecyclemanager.timeout.threshold.

See Interpreter Lifecycle Management

Before this I used a simple shell script that checks the running applications on yarn and kills them if idle for more than 1 hour:

max_life_in_mins=60

zeppelinApps=`yarn application -list 2>/dev/null | grep "RUNNING" | grep "Zeppelin Spark Interpreter" | awk '{print $1}'`

for jobId in $zeppelinApps
do
    finish_time=`yarn application -status $jobId 2>/dev/null | grep "Finish-Time" | awk '{print $NF}'`
    if [ $finish_time -ne 0 ]; then
      echo "App $jobId is not running"
      exit 1
    fi

    time_diff=`date +%s`-`yarn application -status $jobId 2>/dev/null | grep "Start-Time" | awk '{print $NF}' | sed 's!$!/1000!'`
    time_diff_in_mins=`echo "("$time_diff")/60" | bc`

    if [ $time_diff_in_mins -gt $max_life_in_mins ]; then
      echo "Killing app $jobId"
      yarn application -kill $jobId
    fi
done

There is also yarn REST API to do the same thing.