I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code
finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file as of Spark 1.4
Spark job creates plenty of small part files in final output directory. As far as I understand Spark creates part file for each partition/task - is this correct? How do we control amount of part files Spark creates?
Finally, I would like to create Hive table using these parquet/orc directory and I heard Hive is slow when we have large no of small files.
You may want to try using the DataFrame.coalesce method to decrease the number of partitions; it returns a DataFrame with the specified number of partitions (each of which becomes a file on insertion).
To increase or decrease the partitions you can use Dataframe.repartition
function.
But coalesce
does not cause shuffle while repartition
does.