Search code examples
hadoopapache-sparkhdfshadoop-yarn

spark submit on edge node


I am submitting my spark-submit command command through my Edge Node. For this I am using client mode, Now I am accessing my edge node(which is on the same network as my cluster) through my laptop. I know that the driver program runs on my Edge Node, what I want to know is that why does my spark-job automatically suspends when I close my ssh session with the Edge Node? Does opening Edge Node putty connection through VPN/Wireless internet has any effect on the spark job vs using the Ethernet cable from within the network? At present the spark submit job is very slow even though the cluster is really powerful!Please help!

Thanks!


Solution

  • You are submitting the job with --master yarn but possibly you are not specifying --deploy-mode cluster, so the driver application (your Java code) is running locally on this edge node machine. When choosing --deploy-mode cluster the driver will run on your cluster and will overall be more robust.

    The spark job dies when you close the ssh connection because you're killing the driver when doing it, it is running on your terminal session. To avoid this you must send the command as a background job using & at the end of your spark-submit. For example:

    spark-submit --master yarn --class foo bar zaz &

    This will send the driver into the background and the stdout will be sent to your tty, polluting your session but not killing the process when you close the ssh connection. If you however don´t want it to be so polluted you can send the stdout to /dev/null by doing this:

    spark-submit --master yarn --class foo bar zaz &>/dev/null &

    However you won´t know why things failed. You can redirect the stdout to a file too instead of /dev/null.

    Finally, once this is clear enough I strongly recommend not deploying like this your spark jobs, since the driver process in the edge node failing for any funky reason will kill the job running in the cluster. It also has a strange behavior since the job dying in the cluster (Some runtime problem) will not stop nor kill your driver in the edge node, which leads to a lot of wasted memory in that machine if you don´t take care of manually kill all those old driver processes in that machine. All this is avoided by using the flag --deploy-mode cluster in your spark submit.