Search code examples
hadoopapache-pigbigdata

Unable to dump a relation in PIG


Been stuck at a problem since very long. Any help would be appreciable. So I have a dataset file in /home/hadoop/pig directory. I can view that file, thus no permissions issue. The dataset has 4 columns separate by "::" as delimiter. I'm running pig in local mode from inside /home/hadoop/pig directory.

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP ratings BY mid;

dump grouped_mid;

The above script fails. I can successfully dump 'ratingsData' and 'ratings' relations but not the grouped_mid. But here's the bizarre part. The below script runs successfully.

ratingsData = LOAD 'ratings.dat' AS (line:chararray);

ratings = FOREACH ratingsData GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);

STORE ratings INTO 'ratingInfo.txt';

X = LOAD 'ratingInfo.txt' AS (uid:int, mid:int, rating:int, timestamp:long);

grouped_mid = GROUP X BY mid;

dump grouped_mid;

Obviously, the second script has a redundant step. I'm simply storing a relation and reloading it again. I want to avoid this. Any clarification/explanation would be highly appreciable.

Thanks much.


Solution

  • Just reference to this: pig join with java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

    You can modify your scripts to:

    ratingsData = LOAD 'ratings.dat' AS (line:chararray);
    
    ratings = FOREACH ratingsData GENERATE FLATTEN((tuple(int, int, int, long))REGEX_EXTRACT_ALL(line,'(.*?)::(.*?)::(.*?)::(.*?)')) AS (uid:int, mid:int, rating:int, timestamp:long);
    
    grouped_mid = GROUP ratings BY mid;
    
    dump grouped_mid;
    

    Tested.