python pandas dataframe pandas-groupby timedelta

How to find total time between time intervals defined by start and end columns

I have a pandas DataFrame:

enter image description here

I want to calculate the diffrence between confirm and cancel in the following way:

For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00

I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.

For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.

Output should looks like:

I would like to find periods of time when a desk is free. In my data can be many confirms and cancels for desk in one date.

I did it for one hour confirm and cancel:

df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)

and this works.

Any tips?

Solution

You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date and desk_id and computes the total time for each group, with output placed in a new dataframe:

Code to create your input dataframe:

from datetime import timedelta
import pandas as pd

df = pd.DataFrame(
    {
        'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
                 pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
                 pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
        'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
        'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
                    '15:36:00'],
        'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
                   '16:36:00']
    }
)

Solution:

df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])

# function to compute total time each desk is free
def total_time(df):
    return (
        (df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
        (df['confirm'] - df['cancel'].shift()).sum() +
        (timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
    )

# apply function to each combination of 'desk_id' and 'date', producing
# a new dataframe
df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')


#    desk_id          date       total_time
# 0      1.0    2020-01-13  0 days 10:00:00
# 1      1.0    2020-01-14  0 days 11:00:00
# 2      2.0    2020-01-13  0 days 10:00:00
# 3      2.0    2020-01-14  0 days 09:00:00

The function takes the difference between the first value of confirm and 8:00:00, takes differences between each confirm and preceding cancel values, and then the difference between 20:00:00 and the last value of cancel. Those differences added together to produce the final value.