Search code examples
hadoopbigdataapache-pig

Can't Group DateTime by Hour or Dump Result Apache Pig


I'm working on a project that requires me to find the temporal average (e.g: hour, day, month) for multiple datasets and then do calculations on those averages. The issue I am running into is that Apache Pig will not group by the time nor dump the DateTime values. I've tried several solutions posted here on Stack Overlflow and elsewhere to no avail. I've also read over the documentation, and am unable to find a solution.

Here is my code so far:

data = LOAD 'TestData' USING PigStorage(',');

t_data = foreach data generate (chararray)$0 as date, (double)$305 as w_top, (double)$306 as t_top, (double)$310 as w_mid, (double)$311 as t_mid, (double)$315 as w_bot, (double)$316 as t_bot, (double)$319 as pressure;

times = FOREACH t_data GENERATE ToDate(date,'YYYY-MM-ddThh:mm:ss.s') as (date), w_top, t_top, w_mid, t_mid, w_bot, t_bot, pressure;



grp_hourly = GROUP times by GetHour(date);

average = foreach grp_hourly generate flatten(group), times.date, AVG(times.w_top), AVG(times.t_top), AVG(times.w_mid), AVG(times.t_mid), AVG(times.w_bot), AVG(times.t_bot);

And some sample lines from the data:

2011-01-06 15:00:00.0 ,0.07225,-11.36384,-0.045,-11.24599,0.036,-12.44104,1021.707
2011-01-06 15:00:00.1 ,0.09975,-11.34448,-0.0325,-11.26053,0.041,-12.45392,1021.694
2011-01-06 15:00:00.2 ,0.15375,-11.35576,-0.02975,-11.26536,0.01025,-12.44748,1021.407
2011-01-06 15:00:00.3 ,-0.00225,-11.42034,-0.03775,-11.28477,-0.013,-12.44429,1021.764
2011-01-06 15:00:00.4 ,0.01625,-11.33965,-0.0395,-11.27989,-0.0395,-12.42172,1021.484

What I Currently Get as Output:

I get a file with one average of every variable I feed into APACHE Pig without a date and time (most likely the average of each variable over the entire data set). I need them for each hour and to be printed with the output. Any tips would be appreciated. Sorry if my post is messy, I don't post to Stack Overflow often.


Solution

  • The date and time pattern string in ToDate doesn't exactly match your data. You have YYYY-MM-ddThh:mm:ss.s but your data looks like 2011-01-06 15:00:00.0. You need to match the spaces in your data, and since your hours are on the 24 hour, you need to use HH instead of hh. Check out the documentation for Java SimpleDateFormat class. Try this pattern string instead:

    times = FOREACH t_data GENERATE ToDate(date,'yyyy-MM-dd HH:mm:ss.s ') as date;
    

    To debug your code, try dumping right after creating the relation times instead of at the end since it seems like the problem is with ToDate().