Map Reduce Completed but pig Job Failed

I recently came across this scenario where a MapReduce job seems to be successful in RM where as the PIG script returned with an exit code 8 which refers to "Throwable thrown (an unexpected exception)"

Added the script as requested:

REGISTER '$LIB_LOCATION/*.jar'; 

-- set number of reducers to 200
SET default_parallel $REDUCERS;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;

SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;

SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;

-- load data from EAP data catalog using given ($ENV = PROD)
data = LOAD 'eap-$ENV://event'
-- using a custom function
USING com.XXXXXX.pig.DataDumpLoadFunc
('{"startDate": "$START_DATE", "endDate" : "$END_DATE", "timeType" : "$TIME_TYPE", "fileStreamType":"$FILESTREAM_TYPE", "attributes": { "all": "true" } }', '$MAPPING_XML_FILE_PATH');

-- filter out null context entity records
filtered = FILTER data BY (attributes#'context_id' IS NOT NULL);

-- group data by session id
session_groups = GROUP filtered BY attributes#'context_id';

-- flatten events
flattened_events = FOREACH session_groups GENERATE FLATTEN(filtered);

-- remove the output directory if exists
RMF $OUTPUT_PATH;

-- store results in specified output location
STORE flattened_events INTO '$OUTPUT_PATH' USING com.XXXX.data.catalog.pig.EventStoreFunc();

And I can see "ERROR 2998: Unhandled internal error. GC overhead limit exceeded" in the pig logs.(log below)

Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. GC overhead limit exceeded

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.hadoop.mapreduce.FileSystemCounter.values(FileSystemCounter.java:23)
        at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:219)
        at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:199)
        at org.apache.hadoop.mapreduce.counters.FileSystemCounterGroup.findCounter(FileSystemCounterGroup.java:210)
        at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154)
        at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:241)
        at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:370)
        at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:391)
        at org.apache.hadoop.mapred.ClientServiceDelegate.getTaskReports(ClientServiceDelegate.java:451)
        at org.apache.hadoop.mapred.YARNRunner.getTaskReports(YARNRunner.java:594)
        at org.apache.hadoop.mapreduce.Job$3.run(Job.java:545)
        at org.apache.hadoop.mapreduce.Job$3.run(Job.java:543)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at org.apache.hadoop.mapreduce.Job.getTaskReports(Job.java:543)
        at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getTaskReports(HadoopShims.java:235)
        at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:352)
        at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:233)
        at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:282)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1431)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1416)
        at org.apache.pig.PigServer.execute(PigServer.java:1405)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:456)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:439)
        at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:624)

Configuration in the pig script looks like below:

SET default_parallel 200;
SET mapreduce.map.memory.mb 3072;
SET mapreduce.reduce.memory.mb 6144;

SET mapreduce.map.java.opts -Xmx2560m;
SET mapreduce.reduce.java.opts -Xmx5120m;
SET mapreduce.job.queuename dt_pat_merchant;

SET yarn.app.mapreduce.am.command-opts -Xmx5120m;
SET yarn.app.mapreduce.am.resource.mb 6144;

Status of the Job in the RM of the Cluster says the job succeeded [can't post the image as my reputation is too low ;) ]

This issue occurs frequently and we have to restart the job the job successful.

Please let me know a fix for this.

PS: The cluster the job is running is one of the biggest in the world, so no worry with resources or the storage space I say.

Thanks

Solution

Can you add your pig script here?

I think, you get this error because the pig itself (not mappers and reducers) can't handle the output. If you use DUMP operation it your script, then try to limit the displayed dataset first. Let's assume, you have a X alias for your data. Try:

temp = LIMIT X 1;
DUMP temp;

Thus, you will see only one record and save some resources. You can do a STORE operation as well (see in pig manual how to do it).

Obviously, you can configure pig's heap size to be bigger, but pig's heap size is not mapreduce.map or mapreduce.reduce. Use PIG_HEAPSIZE environment variable to do that.