Search code examples
hadoopapache-pigavro

Apache Pig: java.lang.OutOfMemoryError: Java heap space


So I am trying to do a join on two pig relation.

RELATION1 = LOAD '$path' USING AvroStorage();
RELATION2 = LOAD '$path' USING AvroStorage();
RELATION3 = JOIN RELATION1 BY field, JOIN RELATION2 BY field;
STORE RELATION3 INTO '$PATH' USING AvroStorage();

But I am getting the following error:

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.lang.OutOfMemoryError: Java heap space

Seems like it's complaining there's not enough heap space. In my case, relation1 is relatively large eg ~1000GB, relation2 is small. Simply loading relation1 in pig script and do a filter would work. Can someone suggests how I can get around this problem? Thanks!


Solution

  • Since you mention that one of your relations is much smaller than the other, you might want to optimize your Pig scripts. Specifically, if one of your relations is smaller than the other, the smaller relation should go first so that the join is executed more efficiently (read more here):

    RELATION3 = JOIN RELATION2 BY field, RELATION1 BY field;
    

    If one of your relations is so small it can fit into memory, you can do a replicate join (read more here). Note that the order is reverse of the above:

    RELATION3 = JOIN RELATION1 BY field, RELATION2 BY field USING 'replicated';
    

    Additionally, you can use FOREACH statements before the join to select only the variables you need so that less data has to be moved around. Also, do any filtering before the join.

    If you still get Java memory errors with these modifications, you can change mapreduce settings. For example, this other Stack Overflow answer recommends

    SET mapreduce.map.memory.mb 4096;
    SET mapreduce.reduce.memory.mb 6144;
    

    (And there are many other questions/answers found by googling your errors with different recommended settings that you can try.)