I am new to Pig and the Hadoop world. The problem I have may be simple but I am not able to proceed.
So I have the below data which is basically swipe in data for a day. I need to compute the total number of hours an employee spends in a day i.e difference between the first in-time(time he reaches office) and the last out-Time(last swipe of the day) using PIG.
EmpID In_Time Out_Time
1 9:00 10:00
2 8:00 11:00
3 10:00 12:00
1 11:00 13:00
1 14:00 18:00
2 12:00 18:00
3 13:00 18:00
So I wrote the below script but it doesnt seem to give the right result.
grunt> emprec = load '/emptime/emptime' using PigStorage() as (empid:int,in:chararray,out:chararray);
grunt> aggdata = group emprec by empid;
grunt> emptime = foreach aggdata generate (emprec.empid,MIN(emprec.in),MAX(emprec.out));
I dont seem to get the right results with the script written.
The result i need is
Intermediate result (for my understanding)
EmpID In_Time Out_Time
1 9:00 18:00
2 8:00 18:00
3 10:00 18:00
Final Output needed is the difference of the Out_Time-In_Time
EmpID Total_Time
1 9:00
2 10:00
3 8:00
I have written the last line to get the Min and Max time so i can subtract the 2 and get the total time spent in office
Please note, in case you want to assume the time as Int or any other format, please do so as this is just an example.
Thanks in Advance
Regards, Chetan
Using MAX
, MIN
on chararray
datatype is not giving you expected results. if you want to avoid writing UDF
you can follow this solution:
I started with data:
1,9:00,10:00
2,8:00,11:00
3,10:00,12:00
1,11:00,13:00
1,14:00,18:00
2,12:00,18:00
3,13:00,18:00
emprec = load '/home/cloudera/pig.txt' using PigStorage(',') as (empid:int,in:chararray,out:chararray);
cleandata = foreach emprec generate $0 as ID ,(double)REPLACE($1,':','.') as in_time,(double)REPLACE($2,':','.') as out_time; --convert time to double so you can use eval functions on it
aggdata = group cleandata by ID;
emptime = foreach aggdata generate group ,MIN(cleandata.in_time),MAX(cleandata.out_time);
results = foreach emptime generate $0 as Emp_ID,REPLACE((chararray)($2 - $1),'\\.(?s)','\\:0') as time_diff; --convert back to time
dump results;
output:
(1,9:00)
(2,10:00)
(3,8:00)