Search code examples
pythonpandasdataframedatetimegroup-by

Pandas shift to compute time breaks in table


I'm dealing with a table with this format:

Worker_ID start_time end_time
Alice 2023-02-14 09:12:00 2023-02-14 09:26:00
Alice 2023-02-14 09:27:00 2023-02-14 10:22:00
Alice 2023-02-14 11:04:00 2023-02-14 11:28:00
Alice 2023-02-14 11:33:00 2023-02-14 11:54:00
Bob 2023-02-15 08:11:00 2023-02-15 08:42:00
Bob 2023-02-15 09:02:00 2023-02-15 09:31:00
Bob 2023-02-15 09:32:00 2023-02-15 09:35:00
Bob 2023-02-15 09:38:00 2023-02-15 09:58:00
Bob 2023-02-15 10:32:00 2023-02-15 10:49:00

This is, for each worker, it registers the start_time and end_time of shifts in machines. I need to compute the time between breaks, with the condition that they are only valid if they spend more than 15 minutes offline. Otherwise, it is not considered a break. Additionally, if there are no previous data in the day, it must assign the time since the first start_time of the day. So, for each row, I need to have a column indicating the time since the previous break with those conditions:

Worker_ID start_time end_time minutes_since_previous_break
Alice 2023-02-14 09:12:00 2023-02-14 09:26:00 0
Alice 2023-02-14 09:27:00 2023-02-14 10:22:00 15
Alice 2023-02-14 11:04:00 2023-02-14 11:28:00 0
Alice 2023-02-14 11:33:00 2023-02-14 11:54:00 29
Bob 2023-02-15 08:11:00 2023-02-15 08:42:00 0
Bob 2023-02-15 09:02:00 2023-02-15 09:31:00 0
Bob 2023-02-15 09:32:00 2023-02-15 09:35:00 30
Bob 2023-02-15 09:38:00 2023-02-15 09:58:00 36
Bob 2023-02-15 10:32:00 2023-02-15 10:49:00 0

In the case of Alice, the first row is the first of the day, so we assign 0 minutes since break. The second one happens just one minute after the previous is finished, so it doesn't count as break. Therefore, we accumulate the time since the previous break, putting 15 minutes. The third row happens almost 40 minutes after the last shift has ended, so the count restarts and we assign 0 minutes since break. The last Alice row happens with a 5-minute break, so it doesn't count, therefore accumulating the time.

For Bob the logic is the same. I just managed to do:

df = df.sort_values(by=['Worker_ID', 'start_time'])
df['minutes_between_shifts'] = (df['start_time'] - df['end_time'].shift()).dt.total_seconds()/60
df = df[df['minutes_between_shifts']>=15]
df['minutes_since_last_break'] = 0

This is, I compute the minutes between consecutive shifts, and if it is larger than 15 minutes it means there has been a valid break so I assign the value 0. But I can't get my head around on how to apply the two remaining conditions. I tried to do the opposite and make a df copy of those cases with minutes_between_shifts < 15 minutes, but that doesn't solve it. Any ideas are very welcome


Solution

  • Example Code

    import pandas as pd
    data1 = {'Worker_ID': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Bob', 'Bob'], 
             'start_time': ['2023-02-14 09:12:00', '2023-02-14 09:27:00', '2023-02-14 11:04:00', 
                            '2023-02-14 11:33:00', '2023-02-15 08:11:00', '2023-02-15 09:02:00', 
                            '2023-02-15 09:32:00', '2023-02-15 09:38:00', '2023-02-15 10:32:00'], 
             'end_time': ['2023-02-14 09:26:00', '2023-02-14 10:22:00', '2023-02-14 11:28:00', 
                          '2023-02-14 11:54:00', '2023-02-15 08:42:00', '2023-02-15 09:31:00', 
                          '2023-02-15 09:35:00', '2023-02-15 09:58:00', '2023-02-15 10:49:00']}
    df = pd.DataFrame(data1)
    

    Step1

    at first, make dtype to datetime

    df['start_time'] = pd.to_datetime(df['start_time'])
    df['end_time'] = pd.to_datetime(df['end_time'])
    

    Step2

    To do this, you can create a variable s that returns 0 if the break is 15 minutes, and the time difference between the previous start time and the current start time otherwise. Then, you can group the data by s and use the cumsum() function to calculate the cumulative sum.

    g = df.groupby('Worker_ID')
    cond1 = g['end_time'].shift().rsub(df['start_time']) <= pd.Timedelta('15min')
    s = df['start_time'].mask(~cond1, 0).mask(cond1, g['start_time'].diff().div(pd.Timedelta('1min'))).astype('int')
    df['minutes_since_previous_break'] = s.groupby([df['Worker_ID'], s.eq(0).groupby(df['Worker_ID']).cumsum()]).cumsum()
    

    Of course, the worker_IDs may be mixed, so you need to group by worker_ID as well.

    df

    Worker_ID   start_time          end_time     minutes_since_previous_break
    0   Alice   2023-02-14 09:12:00 2023-02-14 09:26:00 0
    1   Alice   2023-02-14 09:27:00 2023-02-14 10:22:00 15
    2   Alice   2023-02-14 11:04:00 2023-02-14 11:28:00 0
    3   Alice   2023-02-14 11:33:00 2023-02-14 11:54:00 29
    4   Bob     2023-02-15 08:11:00 2023-02-15 08:42:00 0
    5   Bob     2023-02-15 09:02:00 2023-02-15 09:31:00 0
    6   Bob     2023-02-15 09:32:00 2023-02-15 09:35:00 30
    7   Bob     2023-02-15 09:38:00 2023-02-15 09:58:00 36
    8   Bob     2023-02-15 10:32:00 2023-02-15 10:49:00 0
    

    Intermidiate

        cond1   s   s.eq(0).groupby(df['Worker_ID']).cumsum()
    0   False   0   1
    1   True    15  1
    2   False   0   2
    3   True    29  2
    4   False   0   1
    5   False   0   2
    6   True    30  2
    7   True    6   2
    8   False   0   3