Spark first() taking a very long time

I was thinking that first() should not take a lot of time(relative to the input size), but it might be that I'm doing something wrong in how it is used. I wasnt to check whether a given value is present, say 'FAIL' in the result column. I want to know if at least one row has a 'FAIL' result column value.

I have this:

is_failed = bool(df.select(df["result"]).filter(df["result"].contains("FAIL")).first())

Is this an efficient way to go about doing that in Pyspark?

Solution

Using first is >5x slower than using isEmpty if you only want to check at least one row is present.

Excusing the Scala, you can prove this out in pyspark as well:

    var start = System.currentTimeMillis()
    var isEmpty = sparkSession.range(10000000)
      .repartition(100)
      .first() == null
    println(isEmpty)
    var end = System.currentTimeMillis()
    // scalastyle:off
    println(s"first duration = ${end - start}")

    start = System.currentTimeMillis()
    isEmpty = sparkSession.range(10000000)
      .repartition(100)
      .isEmpty
    println(isEmpty)
    end = System.currentTimeMillis()
    // scalastyle:off
    println(s"isEmpty duration = ${end - start}")

    isEmpty = sparkSession.range(10000000)
      .repartition(100)
      .first() == null
    println(isEmpty)
    end = System.currentTimeMillis()
    // scalastyle:off
    println(s"first duration = ${end - start}")

    start = System.currentTimeMillis()
    isEmpty = sparkSession.range(10000000)
      .repartition(100)
      .isEmpty
    println(isEmpty)
    end = System.currentTimeMillis()
    // scalastyle:off
    println(s"isEmpty duration = ${end - start}")

The duplication is a poor mans microbenchmark but indicative and quick enough to play around with. The performance difference will increase with the number of fields used by .first() as .isEmpty does not select any columns:

false
first duration = 8784
false
isEmpty duration = 1242
false
first duration = 5034
false
isEmpty duration = 1041