Search code examples
apache-piglimit

Pig - Unable to evaluate Limit expression: NULL


I'm trying to dynamically limit the number of tuples in a bag inside a relation based on a column.

So, this is what I'm trying to do:

--tmp_data: {user_id: bytearray, book: chararray, hotness: double,cnt: long}
grp2 = GROUP tmp_data BY (user_id,cnt);

final_data = FOREACH grp2 {
 sorted = order tmp_data by user_id asc,hotness desc;
 top1 = LIMIT sorted cnt;
 GENERATE FLATTEN(top1);
};

The column "cnt" is a previously calculated count of books that I want to show to a user. So I group by user and count and I get a grouped relation with

grp2: {group: (user_id: bytearray,cnt: long),tmp_data: {(user_id: bytearray,book: chararray,hotness: double,cnt: long)}}

So that I can limit the amount of books, based on the count of each user.

But for some reason, it's not working. It's giving me this weird error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias final_data. Backend error : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [PORelationToExprProject (Name: RelationToExpressionProject[bag][*] - scope-19518 Operator Key: scope-19518) children: null at []]: java.lang.RuntimeException: Unable to evaluate Limit expression: NULL

If I use a constant, it works just fine, but it doesn't like I described above. I'm using 0.11 and I read that we can use a constant in a LIMIT operation.

I also tried

top1 = LIMIT sorted (int)cnt;
top1 = LIMIT sorted tmp_data.cnt;
top1 = LIMIT sorted tmp_data::cnt;
--and with no sorting
top1 = LIMIT tmp_data cnt;

But nothing worked.

Please help. Thanks.


Solution

  • Pig documentation clearly states that you can not use any columns from input relation with LIMIT operator. Either it should be a constant or a scalar. In your case you are using cnt which is a column in input relation.