Search code examples
apache-sparkhadoophadoop2orc

Unable to Merge Small ORC Files using Spark


I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.

I tried to load ORC files to the spark and save with overwrite method

val fileName = "/user/db/table_data/"  //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)

But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.

Can anyone help me to proceed?


Solution

  • As suggested by @Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.