Search code examples
apache-sparkhadoop-yarn

Spark - what triggers a spark job to be re-attempted?


For educational purposes mostly, I was trying to get Yarn + Spark to re-attempt my Spark job on purpose (i.e. fail it, and see it be rescheduled by yarn in another app-attempt).

Various failures seem to cause a Spark job to be re-run; I know I have seen this numerous times. I'm having trouble simulating it though.

I have tried forcefully stopping the streaming context and calling System.exit(-1) and neither achieved the desired affect.


Solution

  • After lots of playing with this, I have seen that Spark + YARN do not play well together with exit codes (at least not for MapR 5.2.1's versions), but I don't think it's MapR-specific.

    Sometimes a spark program will throw an exception and die, and it reports SUCCESS to YARN (or YARN gets SUCCESS somehow), so there are no reattempts.

    Doing System.exit(-1) does not provide any more stable results, sometimes it can be SUCCESS or FAILURE even when the same code is repeated.

    Interestingly, getting a reference to the main thread of the driver and killing it does seem to force a re-attempt; but that is very dirty and requires use of a deprecated function on the thread class.