Search code examples
elasticsearchapache-pig

How to get DISTINCT values of a group of fields in PIG?


Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?

For example
I have input data

12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585



 I want output something like

    12345|9658965|52145
    23456|8541232|96589
    23456|8541232|96585

Solution

  • Approach 1 : Using DISTINCT

    Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct

    DISTINCT operator should help

    test = LOAD 'test.csv' USING PigStorage('|');
    distinct_recs = DISTINCT test;
    DUMP distinct_recs;
    

    Approach 2 : GROUP BY all fields

    test = LOAD 'test.csv' USING PigStorage('|');
    grp_all_fields = GROUP test BY ($0,$1,$2);
    uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
    DUMP uniq_recs;
    

    Both approaches are giving the expected output for the input shared.