Search code examples
apache-sparkapache-spark-sqllimit

spark access first n rows - take vs limit


I want to access the first 100 rows of a spark data frame and write the result back to a CSV file.

Why is take(100) basically instant, whereas

df.limit(100)
      .repartition(1)
      .write
      .mode(SaveMode.Overwrite)
      .option("header", true)
      .option("delimiter", ";")
      .csv("myPath")

takes forever. I do not want to obtain the first 100 records per partition but just any 100 records.

Why is take() so much faster than limit()?


Solution

  • This is because predicate pushdown is currently not supported in Spark, see this very good answer.

    Actually, take(n) should take a really long time as well. I just tested it, however, and get the same results as you do - take is almost instantaneous irregardless of database size, while limit takes a lot of time.