Been stuck at a problem since very long. Any help would be appreciable. So I have a dataset file in /home/hadoop/pig directory. I can view that file, thus no permissions issue. The dataset has 4 columns separate by "::" as delimiter. I'm running pig in local mode from inside /home/hadoop/pig directory.
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
The above script fails. I can successfully dump 'ratingsData' and 'ratings' relations but not the grouped_mid. But here's the bizarre part. The below script runs successfully.
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
STORE ratings INTO 'ratingInfo.txt';
X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP X BY mid;
dump grouped_mid;
Obviously, the second script has a redundant step. I'm simply storing a relation and reloading it again. I want to avoid this. Any clarification/explanation would be highly appreciable.
Thanks much.
Just reference to this: pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
You can modify your scripts to:
ratingsData = LOAD 'ratings.dat' AS (line:chararray);
ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
grouped_mid = GROUP ratings BY mid;
dump grouped_mid;
Tested.