Before I write dataframe into hdfs, I coalesce(1)
to make it write only one file, so it is easily to handle thing manually when copying thing around, get from hdfs, ...
I would code like this to write output.
outputData.coalesce(1).write.parquet(outputPath)
(outputData is org.apache.spark.sql.DataFrame)
I would like to ask if their are any impact on performance vs not coalesce
outputData.write.parquet(outputPath)
Yes, it will write with 1 worker.
So, even through you give 10 CPU core, it will write with 1 worker (single partition).
Problem if your file very big (10 gb or more). But recommend if you have small file (100 mb)