Search code examples
sqlhadoopjoinapache-pig

Computing the sum in a joined table in Apache Pig


I load the following three tables from Hive:

books   = LOAD 'books'   USING org.apache.hive.hcatalog.pig.HCatLoader() AS (isbn_b: chararray, booktitle: chararray, author: chararray, pubyear: chararray, publisher: chararray, urls: chararray, urlm: chararray, urll: chararray);
users   = LOAD 'users'   USING org.apache.hive.hcatalog.pig.HCatLoader() AS (id_u: chararray, location: chararray, age: chararray);
ratings = LOAD 'ratings' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (id_r: chararray, isbn_r: chararray, rating: chararray);

and I join and group them as follows:

OnISBN    = JOIN  ratings BY isbn_r, books BY isbn_b;
total     = JOIN  OnISBN  BY id_r, users BY id_u;
loc_group = GROUP total   BY location;

When I run the command:

final = FOREACH loc_group GENERATE
            group as location,
            COUNT(total) as rec_num,
            SUM(total.rating) as book_rating_sum;

I get a Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast. error. I think this is because in the SUM statement I am not referring correctly to the rating but this is because I am a novice in Pig. What I would like as output is something in the following format:

(location, counts, sum of ratings score over that location)

I know this is something really small but I have been fighting with this for some time and I am stuck. I would appreciate some help.


Solution

  • Your ratings are chararrays, but SUM needs numeric input. Either read it as a numeric type in your LOAD statement, e.g. rating: float, or cast it in your summation, e.g. SUM((float)total.rating).