I'm dealing with a table with this format:
Worker_ID | start_time | end_time |
---|---|---|
Alice | 2023-02-14 09:12:00 | 2023-02-14 09:26:00 |
Alice | 2023-02-14 09:27:00 | 2023-02-14 10:22:00 |
Alice | 2023-02-14 11:04:00 | 2023-02-14 11:28:00 |
Alice | 2023-02-14 11:33:00 | 2023-02-14 11:54:00 |
Bob | 2023-02-15 08:11:00 | 2023-02-15 08:42:00 |
Bob | 2023-02-15 09:02:00 | 2023-02-15 09:31:00 |
Bob | 2023-02-15 09:32:00 | 2023-02-15 09:35:00 |
Bob | 2023-02-15 09:38:00 | 2023-02-15 09:58:00 |
Bob | 2023-02-15 10:32:00 | 2023-02-15 10:49:00 |
This is, for each worker, it registers the start_time and end_time of shifts in machines. I need to compute the time between breaks, with the condition that they are only valid if they spend more than 15 minutes offline. Otherwise, it is not considered a break. Additionally, if there are no previous data in the day, it must assign the time since the first start_time of the day. So, for each row, I need to have a column indicating the time since the previous break with those conditions:
Worker_ID | start_time | end_time | minutes_since_previous_break |
---|---|---|---|
Alice | 2023-02-14 09:12:00 | 2023-02-14 09:26:00 | 0 |
Alice | 2023-02-14 09:27:00 | 2023-02-14 10:22:00 | 15 |
Alice | 2023-02-14 11:04:00 | 2023-02-14 11:28:00 | 0 |
Alice | 2023-02-14 11:33:00 | 2023-02-14 11:54:00 | 29 |
Bob | 2023-02-15 08:11:00 | 2023-02-15 08:42:00 | 0 |
Bob | 2023-02-15 09:02:00 | 2023-02-15 09:31:00 | 0 |
Bob | 2023-02-15 09:32:00 | 2023-02-15 09:35:00 | 30 |
Bob | 2023-02-15 09:38:00 | 2023-02-15 09:58:00 | 36 |
Bob | 2023-02-15 10:32:00 | 2023-02-15 10:49:00 | 0 |
In the case of Alice, the first row is the first of the day, so we assign 0 minutes since break. The second one happens just one minute after the previous is finished, so it doesn't count as break. Therefore, we accumulate the time since the previous break, putting 15 minutes. The third row happens almost 40 minutes after the last shift has ended, so the count restarts and we assign 0 minutes since break. The last Alice row happens with a 5-minute break, so it doesn't count, therefore accumulating the time.
For Bob the logic is the same. I just managed to do:
df = df.sort_values(by=['Worker_ID', 'start_time'])
df['minutes_between_shifts'] = (df['start_time'] - df['end_time'].shift()).dt.total_seconds()/60
df = df[df['minutes_between_shifts']>=15]
df['minutes_since_last_break'] = 0
This is, I compute the minutes between consecutive shifts, and if it is larger than 15 minutes it means there has been a valid break so I assign the value 0. But I can't get my head around on how to apply the two remaining conditions. I tried to do the opposite and make a df copy of those cases with minutes_between_shifts < 15 minutes, but that doesn't solve it. Any ideas are very welcome
Example Code
import pandas as pd
data1 = {'Worker_ID': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Bob', 'Bob'],
'start_time': ['2023-02-14 09:12:00', '2023-02-14 09:27:00', '2023-02-14 11:04:00',
'2023-02-14 11:33:00', '2023-02-15 08:11:00', '2023-02-15 09:02:00',
'2023-02-15 09:32:00', '2023-02-15 09:38:00', '2023-02-15 10:32:00'],
'end_time': ['2023-02-14 09:26:00', '2023-02-14 10:22:00', '2023-02-14 11:28:00',
'2023-02-14 11:54:00', '2023-02-15 08:42:00', '2023-02-15 09:31:00',
'2023-02-15 09:35:00', '2023-02-15 09:58:00', '2023-02-15 10:49:00']}
df = pd.DataFrame(data1)
Step1
at first, make dtype to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
Step2
To do this, you can create a variable s that returns 0 if the break is 15 minutes, and the time difference between the previous start time and the current start time otherwise. Then, you can group the data by s and use the cumsum() function to calculate the cumulative sum.
g = df.groupby('Worker_ID')
cond1 = g['end_time'].shift().rsub(df['start_time']) <= pd.Timedelta('15min')
s = df['start_time'].mask(~cond1, 0).mask(cond1, g['start_time'].diff().div(pd.Timedelta('1min'))).astype('int')
df['minutes_since_previous_break'] = s.groupby([df['Worker_ID'], s.eq(0).groupby(df['Worker_ID']).cumsum()]).cumsum()
Of course, the worker_IDs may be mixed, so you need to group by worker_ID as well.
df
Worker_ID start_time end_time minutes_since_previous_break
0 Alice 2023-02-14 09:12:00 2023-02-14 09:26:00 0
1 Alice 2023-02-14 09:27:00 2023-02-14 10:22:00 15
2 Alice 2023-02-14 11:04:00 2023-02-14 11:28:00 0
3 Alice 2023-02-14 11:33:00 2023-02-14 11:54:00 29
4 Bob 2023-02-15 08:11:00 2023-02-15 08:42:00 0
5 Bob 2023-02-15 09:02:00 2023-02-15 09:31:00 0
6 Bob 2023-02-15 09:32:00 2023-02-15 09:35:00 30
7 Bob 2023-02-15 09:38:00 2023-02-15 09:58:00 36
8 Bob 2023-02-15 10:32:00 2023-02-15 10:49:00 0
Intermidiate
cond1 s s.eq(0).groupby(df['Worker_ID']).cumsum()
0 False 0 1
1 True 15 1
2 False 0 2
3 True 29 2
4 False 0 1
5 False 0 2
6 True 30 2
7 True 6 2
8 False 0 3