Search code examples
configurationapache-sparktemp

Why is spark filling the tmp (spark.local.dir) in the machine that submits jobs?


I have a spark 1.2.1 cluster set up in standalone mode with a master and a few slaves. I then let my data scientists enjoy the cluster's power.

All is working fine. However, the dedicated server that my data scientists used to submit spark jobs have its spark.local.dir filled up gradually.

Given that this machine is sitting outside of the cluster, not a master, nor a worker/slave, I wouldn't think that the local spark.local.dir is used in any way by spark. (And why would it? It only shows the logs.)

I could not find a good doc detailing this part of information. Does anybody have an idea?


Solution

  • Not enough information about your setup to be sure, but I am guessing that the jobs are launched in client mode where the driver would be on your client node.

    From the spark docs: In client mode, the driver is launched in the same process as the client that submits the application. In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.

    I am guessing that in client mode the driver (on your client machine) of the application needs plenty of scratch space to manage the other workers in that case.