I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.
Why is take(100)
basically instant, whereas
df.limit(100)
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.option("header", true)
.option("delimiter", ";")
.csv("myPath")
takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.
Why is take()
so much faster than limit()
?
This is because predicate pushdown is currently not supported in Spark, see this very good answer.
Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.