Search code examples
hadoopapache-pigbigdata

Pig Script to compute Min, Max of emp timings


I am new to Pig and the Hadoop world. The problem I have may be simple but I am not able to proceed.

So I have the below data which is basically swipe in data for a day. I need to compute the total number of hours an employee spends in a day i.e difference between the first in-time(time he reaches office) and the last out-Time(last swipe of the day) using PIG.

EmpID In_Time Out_Time
1     9:00     10:00
2     8:00     11:00
3     10:00    12:00
1     11:00    13:00
1     14:00    18:00
2     12:00    18:00
3     13:00    18:00

So I wrote the below script but it doesnt seem to give the right result.

grunt> emprec = load '/emptime/emptime' using PigStorage() as (empid:int,in:chararray,out:chararray);
grunt> aggdata = group emprec by empid;
grunt> emptime = foreach aggdata generate   (emprec.empid,MIN(emprec.in),MAX(emprec.out));

I dont seem to get the right results with the script written.

The result i need is

Intermediate result (for my understanding)

EmpID  In_Time   Out_Time
1      9:00      18:00
2      8:00      18:00
3      10:00     18:00

Final Output needed is the difference of the Out_Time-In_Time

EmpID  Total_Time  
    1      9:00      
    2      10:00      
    3      8:00  

I have written the last line to get the Min and Max time so i can subtract the 2 and get the total time spent in office

Please note, in case you want to assume the time as Int or any other format, please do so as this is just an example.

Thanks in Advance

Regards, Chetan


Solution

  • Using MAX, MIN on chararray datatype is not giving you expected results. if you want to avoid writing UDFyou can follow this solution:

    I started with data:

    1,9:00,10:00
    2,8:00,11:00
    3,10:00,12:00
    1,11:00,13:00
    1,14:00,18:00
    2,12:00,18:00
    3,13:00,18:00
    

    emprec = load '/home/cloudera/pig.txt' using PigStorage(',') as (empid:int,in:chararray,out:chararray);
    cleandata = foreach emprec generate $0 as ID ,(double)REPLACE($1,':','.') as in_time,(double)REPLACE($2,':','.') as out_time; --convert time to double so you can use eval functions  on it
    aggdata = group cleandata by ID; 
    emptime = foreach aggdata generate group ,MIN(cleandata.in_time),MAX(cleandata.out_time);
    results = foreach emptime generate $0 as Emp_ID,REPLACE((chararray)($2 - $1),'\\.(?s)','\\:0') as time_diff; --convert back to time
    dump results;
    

    output:

    (1,9:00)
    (2,10:00)
    (3,8:00)