I have a pig script running on an emr cluster (emr-5.4.0) using a custom UDF. The UDF is used to lookup some dimensional data for which it imports a (somewhat) large amout of text data.
In the pig script, the UDF is used as follows:
DEFINE LookupInteger com.ourcompany.LookupInteger(<some parameters>);
The UDF stores some data in Map<Integer, Integer>
On some input data the aggregation fails with an exception as follows
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.split(String.java:2377)
at java.lang.String.split(String.java:2422)
[...]
at com.ourcompany.LocalFileUtil.toMap(LocalFileUtil.java:71)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:46)
at com.ourcompany.LookupInteger.exec(LookupInteger.java:19)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:379)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.genericGetNext(POBinCond.java:76)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POBinCond.getNextInteger(POBinCond.java:118)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
This does not occur when the pig aggregation is run with mapreduce
, so a workaround for us is to replace pig -t tez
with pig -t mapreduce
.
As i'm new to amazon emr, and pig with tez, i'd appreciate some hints on how to analyse or debug the issue.
EDIT: It looks like a strange runtime behaviour while running the pig script on tez stack.
Please note that the pig script is using
Map<Integer, Interger>
producing the aforementioned OutOfMemoryError.We found another workaround using tez backend. Using increased values for mapreduce.map.memory.mb
and mapreduce.map.java.opts
(0.8 times of mapreduce.map.memory.mb
). Those values are bound to the ec2 instance types and are usually fixed values (see aws emr task config).
By (temporarily) doubling the values, we were able to make the pig script succeed.
The following values were set for a m3.xlarge core instance, which has default values:
Pig startup command
pig -Dmapreduce.map.java.opts=-Xmx2304m \
-Dmapreduce.map.memory.mb=2880 -stop_on_failure -x tez ... script.pig
EDIT
One colleague came up with the following idea:
Another workaround for the OutOfMemory: GC overhead limit exceeded
could be to add explicit STORE
and LOAD
statements for the problematic relations, that would make tez flush the data to storage. This could also help in debugging the issue, as the (temporary, intermediate) data can be observed with other pig scripts.