Search code examples
hadoopapache-pig

Filter records in Pig


Below is the data

col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,dollar
61,62,63,64,pound

col1,col2,col3 will form the combination of unique keys. The use case is to filter the data based on col5. For the unique key combination we need to filter the record where col5 value is "dollar", only if the same combination has "pound" value.

The expected output is

col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,pound

How to proceed further since there is no special operators in Pig like Hive.

A = load 'test1.csv' using PigStorage(',') as (col1:int,col2:int,col3:int,col4:int,col5:chararray);
B = FILTER A BY col5 == 'pound';

Solution

  • Get all the records with 'pound', then get all records with 'dollar' that does not match with the id combination with 'pound' in col5. Finally, marry them off ... UNION.

    B = FILTER A BY col5 == 'pound';
    C = JOIN A BY (col1,col2,col3) LEFT OUTER,B BY (col1,col2,col3);
    D = FILTER C BY (B::col1 is null);
    E = FOREACH D GENERATE A::col1,A::col2,A::col3,A::col4,A::col5;
    F = UNION B,E;
    DUMP F;
    

    Output

    enter image description here