Search code examples
apache-sparkpyspark

Spark first() taking a very long time


I was thinking that first() should not take a lot of time(relative to the input size), but it might be that I'm doing something wrong in how it is used. I wasnt to check whether a given value is present, say 'FAIL' in the result column. I want to know if at least one row has a 'FAIL' result column value.

I have this:

is_failed = bool(df.select(df["result"]).filter(df["result"].contains("FAIL")).first())

Is this an efficient way to go about doing that in Pyspark?


Solution

  • Using first is >5x slower than using isEmpty if you only want to check at least one row is present.

    Excusing the Scala, you can prove this out in pyspark as well:

        var start = System.currentTimeMillis()
        var isEmpty = sparkSession.range(10000000)
          .repartition(100)
          .first() == null
        println(isEmpty)
        var end = System.currentTimeMillis()
        // scalastyle:off
        println(s"first duration = ${end - start}")
    
        start = System.currentTimeMillis()
        isEmpty = sparkSession.range(10000000)
          .repartition(100)
          .isEmpty
        println(isEmpty)
        end = System.currentTimeMillis()
        // scalastyle:off
        println(s"isEmpty duration = ${end - start}")
    
        isEmpty = sparkSession.range(10000000)
          .repartition(100)
          .first() == null
        println(isEmpty)
        end = System.currentTimeMillis()
        // scalastyle:off
        println(s"first duration = ${end - start}")
    
        start = System.currentTimeMillis()
        isEmpty = sparkSession.range(10000000)
          .repartition(100)
          .isEmpty
        println(isEmpty)
        end = System.currentTimeMillis()
        // scalastyle:off
        println(s"isEmpty duration = ${end - start}")
    

    The duplication is a poor mans microbenchmark but indicative and quick enough to play around with. The performance difference will increase with the number of fields used by .first() as .isEmpty does not select any columns:

    false
    first duration = 8784
    false
    isEmpty duration = 1242
    false
    first duration = 5034
    false
    isEmpty duration = 1041