Search code examples
apache-pig

Apache Pig, adding values for an id from different columns


I have loaded 3 tables for 3 subjects in PIG, each having Name:chararray and score:float. The same name may not occur necessarily in all subjects.

I need to add the scores from the 3 tables into a single one, having Name and Total score.

I used to do this in SQL using nested queries. How to do this in PIG? I tried using full outer join, but was stuck after encountering null values in the Name columns for the subjects in which the name doesn't exist.


Solution

  • From the description of your problem, a simple UNION of the 3 files, followed by a GROUP BY should produce the results you're looking for.


    Example Code:

    union_example.pig

    data_1 = LOAD 'union1.csv' USING PigStorage(',') AS (name:chararray,score:float);
    data_2 = LOAD 'union2.csv' USING PigStorage(',') AS (name:chararray,score:float);
    data_3 = LOAD 'union3.csv' USING PigStorage(',') AS (name:chararray,score:float);
    
    data = UNION data_1,data_2,data_3;
    
    data_grp = GROUP data BY name;
    
    data_gen = FOREACH data_grp GENERATE group, SUM(data.score);
    
    dump data_gen;
    




    Datasets:

    union1.csv

    bob,3
    elvis,4
    jim,4
    dave,2
    sneech,4
    suess,3
    giri,5
    



    union2.csv

    mike,2
    rick,3
    jim,3
    giri,4
    dave,3
    elvis,5
    



    union3.csv

    bob,5
    bing,4
    suess,4
    sneech,5
    dave,4
    jim,2
    giri,2
    




    Results:

    (bob,8.0)
    (jim,9.0)
    (bing,4.0)
    (dave,9.0)
    (giri,11.0)
    (mike,2.0)
    (rick,3.0)
    (elvis,9.0)
    (suess,7.0)
    (sneech,9.0)