Let's say we have the following pandas dataframe, working on python:
worker | shift_start | shift_end | function |
---|---|---|---|
Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A |
Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A |
Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B |
Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B |
Alice | 2022-01-65 12:30:00 | 2022-01-16 15:30:00 | A |
Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B |
Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A |
Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A |
Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A |
Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B |
Where shift_start and shift_end are datetime columns. Each worker has their own schedule, with a function that may vary (irrelevant here). I would like to compute the break time between days, this is, the time difference between the end of the last shift and the beginning of the next one the next day they work:
worker | shift_start | shift_end | function | break_time |
---|---|---|---|---|
Alice | 2022-01-15 11:30:00 | 2022-01-15 15:30:00 | A | NaN |
Alice | 2022-01-15 17:30:00 | 2022-01-15 18:29:59 | A | NaN |
Alice | 2022-01-15 18:30:00 | 2022-01-15 20:30:00 | B | NaN |
Alice | 2022-01-16 10:30:00 | 2022-01-16 11:29:59 | B | 0 days 14 hours |
Alice | 2022-01-16 12:30:00 | 2022-01-16 15:30:00 | A | 0 days 14 hours |
Bob | 2022-01-15 10:30:00 | 2022-01-15 12:29:59 | B | NaN |
Bob | 2022-01-15 12:30:00 | 2022-01-15 14:30:00 | A | NaN |
Bob | 2022-01-15 15:30:00 | 2022-01-15 18:30:00 | A | NaN |
Bob | 2022-01-17 10:30:00 | 2022-01-17 15:30:00 | A | 1 days 16 hours |
Bob | 2022-01-17 16:30:00 | 2022-01-17 18:30:00 | B | 1 days 16 hours |
In both cases, January 15th has "NaN" because we do not know the previous shift. For Alice, there have been 14 hours between the end of her shift at 20:30h and the start of the next one the day after at 10:30. In the case of Bob, however, there is a gap day between, so the difference will be 1 day and 16 hours. This break_time column should be repeated for all the entries in a given day. I am unable to find a way to groupby the worker and then compute the last and first shift for each day. Any help would be much appreciated.
I'm not completely sure about the structure of your data, but you could try the following (with df
your dataframe):
shifted_end = df.groupby("worker")["shift_end"].shift()
m = df["shift_start"].dt.date != shifted_end.dt.date
df["break_time"] = (
(df["shift_start"] - shifted_end)[m].reindex_like(df)
.groupby(df["worker"]).ffill()
)
shifted_end
with shifted values from the column shift_end
. This has to be done grouped by worker
to avoid mixing data from different workers.m
to identify the day breaks by comparing the .dt.date
versions of shift_start
and shifted_end
.shift_end
column from the shift_start
column and extract only the results at the day breaks m
. Use .reindex_like(df)
to get the full index back, with NaN
s at the recovered indices. Then fill the NaN
s forward to propagate the day break values. This, again, has to be done grouped by worker
- otherwise the start of each worker group would get the last value from the worker before.Result for your sample is:
worker shift_start shift_end function break_time
0 Alice 2022-01-15 11:30:00 2022-01-15 15:30:00 A NaT
1 Alice 2022-01-15 17:30:00 2022-01-15 18:29:59 A NaT
2 Alice 2022-01-15 18:30:00 2022-01-15 20:30:00 B NaT
3 Alice 2022-01-16 10:30:00 2022-01-16 11:29:59 B 0 days 14:00:00
4 Alice 2022-01-16 12:30:00 2022-01-16 15:30:00 A 0 days 14:00:00
5 Bob 2022-01-15 10:30:00 2022-01-15 12:29:59 B NaT
6 Bob 2022-01-15 12:30:00 2022-01-15 14:30:00 A NaT
7 Bob 2022-01-15 15:30:00 2022-01-15 18:30:00 A NaT
8 Bob 2022-01-17 10:30:00 2022-01-17 15:30:00 A 1 days 16:00:00
9 Bob 2022-01-17 16:30:00 2022-01-17 18:30:00 B 1 days 16:00:00