I have a list of movies with the release date. I want to get a list of movies that are newer than a given year e.g. 1982, so movies in 1983, 1984 and so on using Apache Pig.
The dates are in the format 01-Jan-1995. I can load the data correctly but my FILTER operation states there is a type mismatch.
I have tried converting the chararray to datetime format however, the result is the date in the format 1995-01-01T00:00:00.000-08:00.
1) How do I retrieve only the year
2) Filter only values that are newer than the selected year?
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, ToString(releaseYear, 'yyyy') AS movieYear;
oldMovies = FILTER nameLookupYear by movieYear < ('1982');
DUMP oldMovies;
Use GetYear() for year part of the datetime object and if you want movies newer than 1982, the filter should be movieYear > 1982
nameLookupYear = FOREACH nameLookup GENERATE movieID, movieTitle, GetYear(releaseYear) AS movieYear;
oldMovies = FILTER nameLookupYear by movieYear > 1982;