I'm new to Hadoop PIG and I have a dataset that looks like this:
region_id region participation score
1 SSA YES 10
1 SSA NO 22
2 MONTPELIER YES 15
....
I want to calculate the sum of scores for each region. The final display that I'm looking for is :
REGION - SCORE, for example:
SSA - 32
I loaded my data:
data = load '/user/cloudera/datapi/pigdata.csv' using PigStorage (',') AS
(id:int, region:chararray, participation:chararray, score:int);
Then grouped the data by region:
split_region = GROUP data by region;
Finally:
RES= foreach split_region GENERATE SUM(data.score), data.region;
the RES variable contains the sum of score for each region but it display all the occurrences of the region like so:
(32 , {SSA,SSA})
What is the missing command/instruction to display (32, SSA)
instead?
Use group
instead of data.region
RES = foreach split_region GENERATE SUM(data.score), group;
See here for source. When you use the GROUP operator, The first field is named "group" (do not confuse this with the GROUP operator) and is the same type as the group key.