Search code examples
apache-sparkdataframehdfsparquet

does coalesce(1) the dataframe before write have any impact on performance?


Before I write dataframe into hdfs, I coalesce(1) to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...

I would code like this to write output.

outputData.coalesce(1).write.parquet(outputPath)

(outputData is org.apache.spark.sql.DataFrame)

I would like to ask if their are any impact on performance vs not coalesce

outputData.write.parquet(outputPath)

Solution

  • Yes, it will write with 1 worker.

    So, even through you give 10 CPU core, it will write with 1 worker (single partition).

    Problem if your file very big (10 gb or more). But recommend if you have small file (100 mb)