Search code examples
pythonpandasdatetimetimepytz

Pandas - Speeding up df.apply() - Calculating time difference


I have working code for calculating the business hours between two dates using df.apply(). However given my df is ~40k rows it's extremely slow, is there a way I could speed this up by vectorising?

Original code:

import datetime
import pytz
import businesstimedelta
import holidays as pyholidays

workday = businesstimedelta.WorkDayRule(
    start_time=datetime.time(9),
    end_time=datetime.time(17),
    working_days=[0, 1, 2, 3, 4])


vic_holidays = pyholidays.AU(prov='VIC')
holidays = businesstimedelta.HolidayRule(vic_holidays)
businesshrs = businesstimedelta.Rules([workday, holidays])

def BusHrs(start, end):
    return businesshrs.difference(start,end).hours+float(businesshrs.difference(start,end).seconds)/float(3600)

df['Diff Hrs'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1)

Gives:

Index   Created Date        Updated Date        Diff Hrs    Current Date
10086   2016-11-04 16:00:00 2016-11-11 11:38:00 35.633333   2018-05-29 10:09:11.291391
10087   2016-11-04 16:03:00 2016-11-29 12:54:00 132.850000  2018-05-29 10:09:11.291391
10088   2016-11-04 16:05:00 2016-11-16 08:05:00 56.916667   2018-05-29 10:09:11.291391
10089   2016-11-04 16:17:00 2016-11-08 11:37:00 11.333333   2018-05-29 10:09:11.291391
10090   2016-11-04 16:20:00 2016-11-16 09:58:00 57.633333   2018-05-29 10:09:11.291391
10091   2016-11-04 16:32:00 2016-11-08 11:10:00 10.633333   2018-05-29 10:09:11.291391

I can see it crunching and looks like it could take 5+ hours to complete.

Curiously, I have a hunch that it is calculating faster the closer the two dates are together. Ex. df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1) is much faster than

df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Created Date'], row['Updated Date']), axis=1)

Optimizing like this is a step above what I'm used to so any help is appreciated.


Solution

  • If you want to speed up your code, you can first redefine your function:

    def BusHrs(start, end):
        diff_hours = businesshrs.difference(start,end)
        return diff_hours.hours+float(diff_hours.seconds)/float(3600)
    

    and then, I think you can do faster by calculating hours between two successives Update Date and then sum over these partial calculation until the Current Date. You need two temporary columns, one with shifted Updated Date and the other one with partial business hours

    # sort from more recent date 
    df = df.sort_values('Updated Date',ascending=False)
    #create a column with shift of 1 and set the Nan to be now
    df['Shift Date'] = df['Updated Date'].shift(1).fillna(pd.datetime.now())
    #calculate partial business hours between two successives update date
    df['BsnHrs Partial'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Shift Date']), axis=1)
    # with this order, you can use cumsum() to add the necessary value of partial business hours
    df['Time Since Last Update'] = df['BsnHrs Partial'].cumsum()
    #drop column not usefull anymore and sort_index to return original order
    df = df.drop(['Shift Date','BsnHrs Partial'],1).sort_index()