I have a pandas DataFrame:
I want to calculate the diffrence between confirm and cancel in the following way:
For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00
I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.
For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.
Output should looks like:
I would like to find periods of time when a desk is free. In my data can be many confirms and cancels for desk in one date.
I did it for one hour confirm and cancel:
df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)
and this works.
Any tips?
You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date
and desk_id
and computes the total time for each group, with output placed in a new dataframe:
Code to create your input dataframe:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame(
{
'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
'15:36:00'],
'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
'16:36:00']
}
)
Solution:
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
# function to compute total time each desk is free
def total_time(df):
return (
(df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
(df['confirm'] - df['cancel'].shift()).sum() +
(timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
)
# apply function to each combination of 'desk_id' and 'date', producing
# a new dataframe
df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')
# desk_id date total_time
# 0 1.0 2020-01-13 0 days 10:00:00
# 1 1.0 2020-01-14 0 days 11:00:00
# 2 2.0 2020-01-13 0 days 10:00:00
# 3 2.0 2020-01-14 0 days 09:00:00
The function takes the difference between the first value of confirm
and 8:00:00, takes differences between each confirm
and preceding cancel
values, and then the difference between 20:00:00 and the last value of cancel
. Those differences added together to produce the final value.