Search code examples
hadoopapache-pigbigdatahadoop-streaming

How to group by key and value using Pig


I am using pig and this is part of the text I want to analyse:

SciTePress: 32    
Springer: 10    
Springer: 13    
Springer: 14    
Springer: 1571

What I am trying to achieve is to sort the text in an ascendant way. For instance, I want the output to look like this:

Springer: 1608  //( i.e. the sum of 10+13+14+1571)
SciTePress: 32

is there a way to achieve this using pig?

This is the output I am getting now:

Springer: 1571
SciTePress: 32  
Springer: 14  
Springer: 13    
Springer: 10  

These are the commands I have used:

    WORDS = LOAD '../filename' using PigStorage(':') AS (title: chararray, count:int);
    grpd = GROUP WORDS BY count;
    sorted = order WORDS by count desc;
    top5 = limit sorted 5;
    dump top5;

Solution

  • We have to group the data based on title and for each group we can call SUM function to get the sum.

    Input :

    SciTePress: 32    
    Springer: 10    
    Springer: 13    
    Springer: 14    
    Springer: 1571
    

    Pig Script :

    words = LOAD '/Users/muralirao/learning/pig/a.csv'  USING PigStorage(':') AS (title: chararray, title_count:int);
    grp_by_title = GROUP  words BY title;
    req_data = FOREACH grp_by_title GENERATE group AS title, SUM(words.title_count) AS total_count;
    req_data_ordered = ORDER req_data BY total_count;
    

    Output : DUMP req_data_ordered

    (SciTePress,32)
    (Springer,1608)