I have an input as a CSV file which I am trying to process with Pig. In the csv, there is a date column which contains corrupt values for some rows. Please suggest me a mechanism to filter out those rows which are corrupt(have corrupt date column) before I apply the ToDate() function to the date column in a FOREACH...GENERATE
statement.
A sample format of my data is:
A,21,12/1/2010 8:26
B,33,12/1/2010 8:26
C,42,i am corrupted
D,30,12/1/2013 9:26
I want to be able to load this and then transform this as:
Assuming csv file is loaded into Y(name,id,date)
X = FOREACH Y GENERATE ToDate(date, 'mm/dd/yyyy HH:mm') AS newdate;
I want to apply a FILTER
to Y
before the above statement to filter out row starting with C. Since, as is, the above statement throws exception and the job fails when I DUMP X;
.
Two cases when ToDate Fails,
1) When the date is missing or syntax is wrong, Filter all the dates using a regular expression,
X = FILTER Y BY (date matches '/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/');
2) When the date falls into DST (https://en.wikipedia.org/wiki/Daylight_saving_time) of your timezone. You have to manually filter that.