Search code examples
scalaapache-sparkorc

Ignoring corrupted Orc files when reading via Spark


I have multiple Orc files in HDFS with the following directory structure:

orc/
├─ data1/
│  ├─ 00.orc
│  ├─ 11.orc
├─ data2/
│  ├─ 22.orc
│  ├─ 33.orc

I am reading these files using Spark:

spark.sqlContext.read.format("orc").load("/orc/data*/")

The problem is one of the files is corrupted so I want to skip/ignore that file.

The only way I see is to get all the Orc files and validate(By reading them) one by one before passing it to Spark. But this way I will be reading the same files twice.

Is there any way I can avoid reading the files twice? Does Spark provide anything regarding this?


Solution

  • This will help you:

    spark.sql("set spark.sql.files.ignoreCorruptFiles=true")