Search code examples
hadoopapache-pig

Populate max value to adjacent record for the same key using Pig


I have data set below

key,value
---------
key1|10
key1|20
key1|30
key2|50
key2|70

I need to populate new column for the same key with max "value" column.

Output must be

key1|10|30
key1|20|30
key1|30|30
key2|50|70
key2|70|70

Below is the Pig script, but facing issues.
A = LOAD 'input.txt' using PigStorage('|');
B = foreach A generate $0,$1,min($1); 


grunt> A = LOAD 'input.txt' using PigStorage('|');
grunt> B = foreach A generate $0,$1,max($1);

2017-05-26 06:48:02,347 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve max using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]

Solution

  • The following code should do. Remember that you need to group the relation first before you can use functions like MAX, MIN, AVG.

    A = load 'file' using PigStorage(',') as (id: chararray, val: int);
    B = GROUP A by id;
    C = FOREACH B GENERATE FLATTEN(group), MAX(A.val) as (maxval: int);
    D = JOIN A by id, C BY group;
    E = FOREACH D generate A::id, A::val, C::maxval;
    DUMP E;
    

    Run this and you should get:

    (key1,30,30)
    (key1,20,30)
    (key1,10,30)
    (key2,70,70)
    (key2,50,70)