Search code examples
pythonpandasdataframepandas-groupbytimedelta

How to find total time between time intervals defined by start and end columns


I have a pandas DataFrame:

enter image description here

I want to calculate the diffrence between confirm and cancel in the following way:

For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00

I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.

For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.

Output should looks like: 2

I would like to find periods of time when a desk is free. In my data can be many confirms and cancels for desk in one date.

I did it for one hour confirm and cancel:

df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)

and this works.

Any tips?


Solution

  • You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date and desk_id and computes the total time for each group, with output placed in a new dataframe:

    Code to create your input dataframe:

    from datetime import timedelta
    import pandas as pd
    
    df = pd.DataFrame(
        {
            'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
                     pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
                     pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
            'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
            'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
                        '15:36:00'],
            'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
                       '16:36:00']
        }
    )
    

    Solution:

    df['confirm'] = pd.to_timedelta(df['confirm'])
    df['cancel'] = pd.to_timedelta(df['cancel'])
    
    # function to compute total time each desk is free
    def total_time(df):
        return (
            (df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
            (df['confirm'] - df['cancel'].shift()).sum() +
            (timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
        )
    
    # apply function to each combination of 'desk_id' and 'date', producing
    # a new dataframe
    df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')
    
    
    #    desk_id          date       total_time
    # 0      1.0    2020-01-13  0 days 10:00:00
    # 1      1.0    2020-01-14  0 days 11:00:00
    # 2      2.0    2020-01-13  0 days 10:00:00
    # 3      2.0    2020-01-14  0 days 09:00:00
    
    

    The function takes the difference between the first value of confirm and 8:00:00, takes differences between each confirm and preceding cancel values, and then the difference between 20:00:00 and the last value of cancel. Those differences added together to produce the final value.