Search code examples
apache-pig

Filtering Dates Using Apache Pig


I have a list of movies with the release date. I want to get a list of movies that are newer than a given year e.g. 1982, so movies in 1983, 1984 and so on using Apache Pig.

The dates are in the format 01-Jan-1995. I can load the data correctly but my FILTER operation states there is a type mismatch.

I have tried converting the chararray to datetime format however, the result is the date in the format 1995-01-01T00:00:00.000-08:00.

1) How do I retrieve only the year

2) Filter only values that are newer than the selected year?

ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, ToString(releaseYear, 'yyyy') AS movieYear;
oldMovies = FILTER nameLookupYear by movieYear < ('1982');

DUMP oldMovies;

Solution

  • Use GetYear() for year part of the datetime object and if you want movies newer than 1982, the filter should be movieYear > 1982

    nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, GetYear(releaseYear) AS movieYear;
    oldMovies = FILTER nameLookupYear by movieYear > 1982;