Search code examples
apache-sparkrdd

Abort RDD map (all mappers) on condition


I have a huge file to process, loaded into an RDD, and performing some validations on its lines using map function. I have a set of errors that are fatal for the whole file if encountered even on one line of the file. Thus, I would like to abort any other processing (all launched mappers across the whole cluster) as soon as that validation fails on a line (to save some time).

Is there a way to archive this?

Thank you.

PS: Using Spark 1.6, Java API


Solution

  • Well, after further search and understanding of Spark transformations laziness , I just have to do something like:

    rdd.filter(checkFatal).take(1)
    

    Then due to the laziness, the processing will just stop itself once one record matching that rule is found :)