Search code examples
hadooperror-handlinghivecorrupt

handle corrupted gzip files in hadoop / hive



I have daily folders with a lot of tar.gz files on HDFS containing a large number of text files.
A number of those tar.gz were found to be corrupted and cause hive/mapreduce jobs to crash with an "unexpected end of stream" when processing those files.

I identified a few of those and tested them with tar -zxvf. They indeed exit with an error but still extract a decent number of files before this happens.

Is there a way to stop hive/mapreduce jobs to simply crash when a tar/gz file is corrupted? I've tested some error skipping and failure tolerance parameters such as
mapred.skip.attempts.to.start.skipping,
mapred.skip.map.max.skip.records,
mapred.skip.mode.enabled,
mapred.map.max.attempts,
mapred.max.map.failures.percent,
mapreduce.map.failures.maxpercent.

It helped in a small number of cases to get a complete folder processed without crashing but mostly this caused the job to hang and not finish at all.

Unzipping every single file outside hadoop just to recompress them aftewards (to get clean gzip files) to then upload to hdfs again would be such a painful process (because of the extra steps and the large volume of data this would generate)

Is there a cleaner / more elegant solution that someone has found?

Thanks for any help.


Solution

  • I'm super late to the party here, but I just faced this exact issue with corrupt gzip files. I ended up solving it by writing my own RecordReader which would catch IOExceptions, log the name of the file that had a problem, and then gracefully discard that file and move on to the next one.

    I've written up some details (including code for the custom Record Reader here: http://daynebatten.com/2016/03/dealing-with-corrupt-or-blank-files-in-hadoop/