Search code examples
hadoopapache-pig

How to store distinct values in a list for the same key using Pig


I have a use case

col1|col2
a101|10
a101|20
a101|10
a101|30
a201|40
a201|50

Expected output:

a101|List<10,20,30>

a201|List<40,50>

Below is the query, but I am not getting the output as expected. I want to store col2 distinct values in a list.

input1= load 'list1.csv' using PigStorage('|') as (col1: chararray, col2: int);
input2 = DISTINCT (FOREACH input1 generate col1,col2);
input3 = GROUP input2 by col1;
dump input3;
(a101,{(a101,30),(a101,20),(a101,10)})
(a201,{(a201,50),(a201,40)})

Solution

  • Try this:

    input1= load 'input.txt' using PigStorage('|') as (col1: chararray, col2: int);
    input2 = DISTINCT input1; --distinct not required but will remove duplicate rows 
    input3 = GROUP input2 by col1;
    data = FOREACH input3 GENERATE FLATTEN(group) as col1, input2.col2 AS col2;
    DUMP data;
    

    Output Generated:

    (a101,{(30),(20),(10)})
    (a201,{(50),(40)})