Search code examples
apache-pig

Find 5 top popular based on sum in Pig Script


I'm trying to find the top 3 most popular locations with the greatest tripCount. So I need to see the total of tripCount per location and return the greatest n...

My data is as follow:

LocationID  tripCount  tripDistance
101            40           4.6
203            29           1.3
56             25           9.3
101            17           4.5
66             5            1.1
13             5            0.5
203            10           1.2
558            8            0.5
56             10           5.5 

So the result I'm expecting is:

 101     57 
 203     39
 56      35

So far my code is:

B = GROUP UNION_DATA BY DOLocationID;
C = FOREACH B {                          
DA = ORDER UNION_DATA BY passenger_count DESC;                
DB = LIMIT DA 5;                         
GENERATE FLATTEN(group), FLATTEN(DB.LocationID), FLATTEN(DB.dropoff_datetime);
}

What am I missing and what do I need to do to get the expected result?


Solution

  • Below piece of code should get you desired results. I broke down the statement into simple chunks for better understanding and readability.Also your alias and code provided seems incomplete so i completely re-wrote from scratch.

    LocationID,tripCount,tripDistance

    cat > trip_data.txt 
        101,40,4.6
        203,29,1.3 
        56,25,9.3 
        101,17,4.5 
        66,5,1.1 
        13,5,0.5 
        203,10,1.2 
        558,8,0.5
        56,10,5.5
    

    PIG Code:

    A = load '/home/ec2-user/trip_data.txt' using PigStorage(',') as (LocationID,tripCount,tripDistance);
        describe A;
        B = GROUP A BY LocationID;
        describe B;
        dump B;
        C = FOREACH B GENERATE group, SUM(A.tripCount);
        describe C;
        dump C;
        D = ORDER C BY $1 DESC;
        describe D;
        dump D;
        RESULT = LIMIT D 3;
        describe RESULT;
        dump RESULT;