Search code examples
hadoophiveapache-pig

How to avoid bad records in Hadoop PIG and Hive?


Hi I am new in Hadoop i found that badrecords for any input format can be skipped in Java map reduce using SkipBadRecord class,so i just want to know how it possible in both Pig and Hive?


Solution

  • Bad Record Handling in Hive

    To filter bad records in hive, you can enable skip mode in query. Hive configuration for skip mode is:

    SET mapred.skip.mode.enabled = true;
    

    You need to set above command before your hive query. You can also limit the configuration by providing following parameters:

    SET mapred.map.max.attempts = 100; 
    SET mapred.reduce.max.attempts = 100;
    SET mapred.skip.map.max.skip.records = 30000;
    SET mapred.skip.attempts.to.start.skipping = 1
    

    More detail about this is available in this link

    Bad Record Handling in Pig

    Pig is itself designed to handle bad records. When processing gigabytes or terabytes of data, the odds are overwhelming that at least one row is corrupt or will cause an unexpected result. An example is division by zero, even though no records were supposed to have a zero in the denominator. Causing an entire job to fail over one bad record is not good. To avoid these failures, Pig inserts a null, issues a warning, and continues processing. This way, the job still finishes. Warnings are aggregated and reported as a count at the end. You should check the warnings to be sure that the failure of a few records is acceptable in your job. If you need to know more details about the warnings, you can turn off the aggregation by passing -w on the command line.

    Please see the this useful link related to pig