Search code examples
hadoopfiltermapreduceapache-pig

Need to filter records by 1 minute in Pig script


The requirement is to filter records in Pig for a particular day. So the sample data is a follows:

date_time                visits           count
2017-08-25 02:05:11        12345            5
2017-08-25 02:05:31        23456            7
2017-08-25 02:05:51        34567            1
2017-08-25 02:06:40        13423            3

In the above case, we just need the first 3 hits. So the filter condition will be start_time == 02:05:00 and end time == 02:06:00

Is there any way this can be achieved in Pig? I went through all the built-in functions, but all of them are specific to date. None work on the time part.

Please do let me know if you need more information on this.


Solution

  • GetMinute should help you out in filtering the records.Create a new column minute using the first column and use that to filter the records.

    Note that you can have the same minute value in other hourly timestamp in which case you can create an hour column and use that in the filter.

    If your date_time column is already of datatype datetime then apply GetHour(),GetMinute() on the date_time column without the Todate() function.

    B = FOREACH A GENERATE date_time,GetHour(ToDate(date_time,'yyyy-MM-dd HH:mm:ss')) as hour,GetMinute(ToDate(date_time,'yyyy-MM-dd HH:mm:ss')) as minute,visits,counts;
    C = FILTER B BY (hour == 2 AND minute == 5);