Search code examples
hadoopapache-pig

How to use 2 for loops in apache pig


How do I use 2 for loops in Apache Pig?

I have input data as below:

1  a 3
15 b 4
1  b 2
25 a 5
15 c 3
1  a 3
15 c 2
25 b 4

Intermediate Output: For 1 count total no. of a and b, similar for 15 and 25

1 a 6
1 b 2
15 b 4
15 c 5
25 a 5
25 b 4

Final output: Need for 1 max count

1 a 6
15 c 5
25 a 5

Solution

  • A = load 'test.input' using PigStorage() as (index:int, id:chararray, count:int);
    B = GROUP A by (index, id);
    C = FOREACH B GENERATE flatten(group), SUM(A.count) as sum;
    
    store C into '/tmp/intermediate';
    
    D = GROUP C by index;
    
    E = FOREACH D {
        ORDERED_C = order C by sum DESC;
        LIMIT_C = LIMIT ORDERED_C 1;
        GENERATE FLATTEN(LIMIT_C);  -- flatten to take out the unnecessary bag
    }
    store E into '/tmp/final';