I was thinking that first() should not take a lot of time(relative to the input size), but it might be that I'm doing something wrong in how it is used. I wasnt to check whether a given value is present, say 'FAIL' in the result column. I want to know if at least one row has a 'FAIL' result column value.
I have this:
is_failed = bool(df.select(df["result"]).filter(df["result"].contains("FAIL")).first())
Is this an efficient way to go about doing that in Pyspark?
Using first is >5x slower than using isEmpty if you only want to check at least one row is present.
Excusing the Scala, you can prove this out in pyspark as well:
var start = System.currentTimeMillis()
var isEmpty = sparkSession.range(10000000)
.repartition(100)
.first() == null
println(isEmpty)
var end = System.currentTimeMillis()
// scalastyle:off
println(s"first duration = ${end - start}")
start = System.currentTimeMillis()
isEmpty = sparkSession.range(10000000)
.repartition(100)
.isEmpty
println(isEmpty)
end = System.currentTimeMillis()
// scalastyle:off
println(s"isEmpty duration = ${end - start}")
isEmpty = sparkSession.range(10000000)
.repartition(100)
.first() == null
println(isEmpty)
end = System.currentTimeMillis()
// scalastyle:off
println(s"first duration = ${end - start}")
start = System.currentTimeMillis()
isEmpty = sparkSession.range(10000000)
.repartition(100)
.isEmpty
println(isEmpty)
end = System.currentTimeMillis()
// scalastyle:off
println(s"isEmpty duration = ${end - start}")
The duplication is a poor mans microbenchmark but indicative and quick enough to play around with. The performance difference will increase with the number of fields used by .first() as .isEmpty does not select any columns:
false
first duration = 8784
false
isEmpty duration = 1242
false
first duration = 5034
false
isEmpty duration = 1041