Search code examples
apache-sparkspark-streamingspark-structured-streaming

Does stage kill from Spark UI causes a reprocessing of the data?


In Spark UI there is this capability of killing an actively running stage:

enter image description here

When a stage is killed with this button, the tasks that are associated with this stage will be reprocessed? Or they'll be skipped?

I noticed that when a stage is killed, the associated job is killed as well, which makes me think that no reprocessing happens, but I'd like to know if there is an official doc or something which states this clearly (or if someone played around with this feature and knows how it works).


Solution

  • In active Stages, it’s possible to kill the active Stage with the kill link - as you show.

    The App is killed as well normally, so no re-processing of Stage occurs. Neither can that killed Stage be skipped for current path of processing; that would be not so logical.

    If you have an App with more independent paths of processing, then the other paths of processing continue.

    Note that only in failed stages, a failure reason is shown. You can still see the output of History Server, of course.

    That is to say, confirming your observations.