Search code examples
csvdatetimeapache-pig

Avoid exception in ToDate in Pig for individual rows


I have an input as a CSV file which I am trying to process with Pig. In the csv, there is a date column which contains corrupt values for some rows. Please suggest me a mechanism to filter out those rows which are corrupt(have corrupt date column) before I apply the ToDate() function to the date column in a FOREACH...GENERATE statement.

A sample format of my data is:

A,21,12/1/2010 8:26
B,33,12/1/2010 8:26
C,42,i am corrupted
D,30,12/1/2013 9:26

I want to be able to load this and then transform this as:

Assuming csv file is loaded into Y(name,id,date)

X = FOREACH Y GENERATE ToDate(date, 'mm/dd/yyyy HH:mm') AS newdate;

I want to apply a FILTER to Y before the above statement to filter out row starting with C. Since, as is, the above statement throws exception and the job fails when I DUMP X;.


Solution

  • Two cases when ToDate Fails,

    1) When the date is missing or syntax is wrong, Filter all the dates using a regular expression,

    X = FILTER Y BY (date matches '/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/');
    

    2) When the date falls into DST (https://en.wikipedia.org/wiki/Daylight_saving_time) of your timezone. You have to manually filter that.