Search code examples
countgroup-byapache-pig

pig programming to use split on group by having count(*)


Input file is:

2, cornflakes, Regular,General Mills, 12
3, cornflakes, Mixed Nuts, Post, 14
4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

filter3 = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);

SPLIT filter3 INTO filter4 IF (FOREACH (filter3 GROUP BY item) GENERATE group, COUNT(item < 3)), filter6_pass OTHERWISE;

It is like having a SQL with a group by on item having count(*) < 3

The desired output is:

4, chocolate syrup, Regular, Hersheys, 5
5, chocolate syrup, No High Fructose, Hersheys, 8
6, chocolate syrup, Regular, Ghirardeli, 6
7, chocolate syrup, Strawberry Flavor, Ghirardeli, 7

Solution

  • Group by item, get the count and then use filter on the count

    A = LOAD 'location_of_file' using PigStorage('\t') as (item_sl : int, item : chararray, type: chararray, manufacturer: chararray, price : int);
    B = GROUP A BY item;
    C = FOREACH B GENERATE group,COUNT(A.item) AS Total;
    D = FILTER C BY Total > 3;
    E = JOIN A BY item,D BY $0;
    F = FOREACH E GENERATE $0..$4;
    DUMP F;
    

    enter image description here