Search code examples
apache-pig

How to findout how many number of tuples are there in a GROUP


This is my input

10001 AMERICAN EXPRESS,TX, Y
10001 BOFA,IL,N
10001 CHASE,NJ,Y
10002 CHASE,IL,Y
10002 BOFA,TX,Y
10002 AMERICAN EXPRESS,NJ,Y

10001 AMERICAN EXPRESS,TX, Y
10001 BOFA,IL,N
10001 CHASE,NJ,Y
10002 CHASE,IL,Y
10002 BOFA,TX,Y

I have to group my using key Intermediate output

10001, {(AMERICAN EXPRESS,TX,Y),(BOFA,IL,N),(CHASE,NJ,Y)}
10002, {(CHASE,IL,Y),(BOFA,TX,Y)}

10001, {(AMERICAN EXPRESS,TX,Y),(BOFA,IL,N),(CHASE,NJ,Y)}
10002, {(CHASE,IL,Y),(BOFA,TX,Y)}

Then i have to find out how many keys are there in each group having more than one tuple.

1001, count(tuples)>1 - count -3
1002, Count(tuples)>1 - count 2

Can someone please help me out.


Solution

  • COUNT on the second field to get the counts after grouping and FILTER all groups with a count > 1.

    A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:chararray,f3:chararray,f4:chararray);
    B = GROUP A BY f1;
    C = FOREACH B GENERATE group,COUNT(f2) AS Total;
    D = FILTER C BY (Total > 1);
    DUMP D;