apache-spark apache-spark-sql orc spark-checkpoint

How to handle failure scenario in Spark write to orc file

I have a use case where I am pushing the data from Mongodb to HDFS in orc file which runs every 1 day interval and appends the data in orc file existing in hdfs.

Now my concern is if while writing to orc file , the job somehow gets failed or stopped. How should I handle that scenario taking in consideration that some data is already written in orc file. I want to avoid duplicate in orc file.

Snippet for writing to orc file format -

  val df = sparkSession
          .read
          .mongo(ReadConfig(Map("database" -> "dbname", "collection" -> "tableName")))
          .filter($"insertdatetime" >= fromDateTime && $"insertdatetime" <= toDateTime)

        df.write
          .mode(SaveMode.Append)
          .format("orc")
          .save(/path_to_orc_file_on_hdfs)

I don't want to go for checkpoint the complete RDD as it will be very expensive operation. Also, I don't want to create multiple orc file. Requirement is to maintain single file only.

Any other solution or approach I should try ?

Solution

Hi one of the best approach will be to write you data as one folder per day under HDFS.

So if you ORC writing job fails you will be able to clean up the folder.

The cleaning should occurs in the bash side of you job. If return code != 0 then delete ORC folder. And then retry.

Edit : a partitionning by writing date will be more powerfull on you ORC reading later on with spark