Search code examples
hadoopdatasetbigdataapache-pig

How to group result by region with pig latin?


I'm new to Hadoop PIG and I have a dataset that looks like this:

region_id        region         participation   score

    1             SSA               YES          10
    1             SSA               NO           22
    2             MONTPELIER        YES          15
    ....

I want to calculate the sum of scores for each region. The final display that I'm looking for is :
REGION - SCORE, for example:

SSA - 32

I loaded my data:

data = load '/user/cloudera/datapi/pigdata.csv' using PigStorage (',') AS
 (id:int, region:chararray, participation:chararray, score:int);

Then grouped the data by region:

split_region = GROUP data by region;

Finally:

RES= foreach split_region GENERATE SUM(data.score), data.region;

the RES variable contains the sum of score for each region but it display all the occurrences of the region like so:

(32 , {SSA,SSA})

What is the missing command/instruction to display (32, SSA) instead?


Solution

  • Use group instead of data.region

    RES = foreach split_region GENERATE SUM(data.score), group;
    

    See here for source. When you use the GROUP operator, The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.