Search code examples
hadoopapache-pig

Order of Apache Pig Transformations


I am reading through Pig Programming by Alan Gates.

Consider the code:

ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS 
    (userID:int, movieID:int, rating:int, ratingTime:int);

metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS 
    (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);

nameLookup = FOREACH metadata GENERATE 
    movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;

nameLookupYear = FOREACH nameLookup GENERATE 
    movieID, movieTitle, GetYear(releaseYear) AS finalYear;

filterMovies = FILTER nameLookupYear BY finalYear < 1982;

groupedMovies = GROUP filterMovies BY finalYear;

orderedMovies = FOREACH groupedMovies {
    sortOrder = ORDER metadata by finalYear DESC;
    GENERATE GROUP, finalYear;
    };

DUMP orderedMovies;

It states that

"Sorting by maps, tuples or bags produces error".

I want to know how I can sort the grouped results.

Do the transformations need to follow a certain sequence for them to work?


Solution

  • Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):

    orderedMovies = ORDER groupedMovies BY group ASC;
    
    DUMP orderedMovies;