I have loaded 3 tables for 3 subjects in PIG, each having Name:chararray and score:float. The same name may not occur necessarily in all subjects.
I need to add the scores from the 3 tables into a single one, having Name and Total score.
I used to do this in SQL using nested queries. How to do this in PIG? I tried using full outer join, but was stuck after encountering null values in the Name columns for the subjects in which the name doesn't exist.
From the description of your problem, a simple UNION
of the 3 files, followed by a GROUP BY
should produce the results you're looking for.
data_1 = LOAD 'union1.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_2 = LOAD 'union2.csv' USING PigStorage(',') AS (name:chararray,score:float);
data_3 = LOAD 'union3.csv' USING PigStorage(',') AS (name:chararray,score:float);
data = UNION data_1,data_2,data_3;
data_grp = GROUP data BY name;
data_gen = FOREACH data_grp GENERATE group, SUM(data.score);
dump data_gen;
bob,3
elvis,4
jim,4
dave,2
sneech,4
suess,3
giri,5
union2.csv
mike,2
rick,3
jim,3
giri,4
dave,3
elvis,5
union3.csv
bob,5
bing,4
suess,4
sneech,5
dave,4
jim,2
giri,2
(bob,8.0)
(jim,9.0)
(bing,4.0)
(dave,9.0)
(giri,11.0)
(mike,2.0)
(rick,3.0)
(elvis,9.0)
(suess,7.0)
(sneech,9.0)