Search code examples
bashapache-sparkhdfsspark-streamingorc

What should be the minimum size of a valid ORC file with snappy compression


The scenario I am dealing with here is each hour 10k orc files are getting generated in HDFS by spark streaming application and after the end of the hour, a spark merge job runs and merge those small files in some bigger chunk and write it to hive landing path for external table to pick up. Sometimes, a corrupt ORC file is making the merge job to fail. The job would be to find out the corrupt ORC file and move it into a badrecordspath and then let the spark merge job begin. After going through the theory of ORC file, it seems a valid ORC file will have "ORC"(as a string) followed by another byte in the end of the file. How do I check that in optimised way so that it won't take much time to validate those 10K orc files. I thought of writing bash shell script but it seems to take some good amount of time to validate HDFS orc files. My idea is to narrow down the validation if I know the minimum size of a valid ORC file coz most of our corrupt files are very tiny in size(mostly 3 bytes). So if I get any suggestion, that would be very helpful.

PS: I can't use set spark.sql.files.ignoreCorruptFiles=true because I have to track the files and move those to bad records path.


Solution

  • Found out a solution. We can use set spark.sql.files.ignoreCorruptFiles=true and then we can track the ignored files using the below method:

        def trackIgnoreCorruptFiles(df: DataFrame): List[Path] = {
    
        val listOfFileAfterIgnore = df.withColumn("file_name", input_file_name)
          .select("file_name")
          .distinct()
          .collect()
          .map(x => new Path(x(0).toString))
          .toList
    
     
        listOfCompleteFiles.diff(listOfFileAfterIgnore)
      }
    

    input_file_name is an in built spark udf which returns the complete path of the file and we are getting it as a column in that dataframe df.This method returns the list of path of those files remain after ignore by spark. The list diff will give you the actual list of files ignored by spark. Then you can easily move those list of files to a badRecordsPath for future analysis.