I load the following three tables from Hive:
books = LOAD 'books' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (isbn_b: chararray, booktitle: chararray, author: chararray, pubyear: chararray, publisher: chararray, urls: chararray, urlm: chararray, urll: chararray);
users = LOAD 'users' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (id_u: chararray, location: chararray, age: chararray);
ratings = LOAD 'ratings' USING org.apache.hive.hcatalog.pig.HCatLoader() AS (id_r: chararray, isbn_r: chararray, rating: chararray);
and I join and group them as follows:
OnISBN = JOIN ratings BY isbn_r, books BY isbn_b;
total = JOIN OnISBN BY id_r, users BY id_u;
loc_group = GROUP total BY location;
When I run the command:
final = FOREACH loc_group GENERATE
group as location,
COUNT(total) as rec_num,
SUM(total.rating) as book_rating_sum;
I get a Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
error. I think this is because in the SUM
statement I am not referring correctly to the rating
but this is because I am a novice in Pig. What I would like as output is something in the following format:
(location, counts, sum of ratings score over that location)
I know this is something really small but I have been fighting with this for some time and I am stuck. I would appreciate some help.
Your ratings are chararrays, but SUM
needs numeric input. Either read it as a numeric type in your LOAD
statement, e.g. rating: float
, or cast it in your summation, e.g. SUM((float)total.rating)
.