I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite)
is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite)
method, it was throwing error file already exists.
Can anyone help me to proceed?
As suggested by @Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.