Search code examples
scalaapache-sparkparquet

Spark save(write) parquet only one file


if i write

dataFrame.write.format("parquet").mode("append").save("temp.parquet")

in temp.parquet folder i got the same file numbers as the row numbers

i think i'm not fully understand about parquet but is it natural?


Solution

  • Use coalesce before write operation

    dataFrame.coalesce(1).write.format("parquet").mode("append").save("temp.parquet")


    EDIT-1

    Upon a closer look, the docs do warn about coalesce

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)

    Therefore as suggested by @Amar, it's better to use repartition