Search code examples
pandasdataframegroup-by

Calculate the actual duration of successive or parallel task with Python Pandas


I have a pandas dataframe with many rows. In each row I have an object and the duration of the machining on a certain machine (with a start time and an end time). Each object can be processed in several machines in succession. I need to find the actual duration of all jobs. For example:

Object Machine T start T end
1 A 17:26 17:57
1 B 17:26 18:33
1 C 18:56 19:46
2 A 14:00 15:00
2 C 14:30 15:00
3 A 12:00 12:30
3 C 13:00 13:45

For object 1 the actual duration is 117 minutes,for object 2 is 60 minutes and for object 3 is 75 minutes. I tried with a groupby where I calculated the sum of the durations of the processes for each object and the minimum and maximum values, i.e. the first start and the last end. Then I wrote a function that compares these values ​​but it doesn't work in case of object 1, and it works for object 2 and 3. Here my solution:

Object min max sumT LT_ACTUAL
1 17:26 19:46 148 140 ERROR!
2 14:00 15:00 90 60 OK!
3 12:00 13:45 75 75 OK!
def calc_lead_time(min_t_start, max_t_end, t_sum):
    t_max_min = (max_t_end - min_t_start) / pd.Timedelta(minutes=1)
    if t_max_min <= t_sum:
        return t_max_min
    else:
        return t_sum
    
df['LT_ACTUAL'] = df.apply(lambda x : calc_lead_time(x['min'], x['max'], x['sumT']), axis=1)

Solution

  • Assuming the data is sorted by start time, and that one task duration is not fully within another one, you can use:

    start = pd.to_timedelta(df['T start']+':00')
    end = pd.to_timedelta(df['T end']+':00')
    
    s = start.groupby(df['Object']).shift(-1)
    
    (end.mask(end.gt(s), s).sub(start)
        .groupby(df['Object']).sum()
    )
    

    Output:

    Object
    1   0 days 01:57:00
    2   0 days 01:00:00
    3   0 days 01:15:00
    dtype: timedelta64[ns]
    

    For minutes:

    start = pd.to_timedelta(df['T start']+':00')
    end = pd.to_timedelta(df['T end']+':00')
    
    s = start.groupby(df['Object']).shift(-1)
    
    (end.mask(end.gt(s), s).sub(start)
        .groupby(df['Object']).sum()
        .dt.total_seconds().div(60)
    )
    

    Output:

    Object
    1    117.0
    2     60.0
    3     75.0
    dtype: float64
    

    handling overlapping intervals

    See here for the logic of the overlapping intervals grouping.

    (df.assign(
            start=pd.to_timedelta(df['T start']+':00'),
            end=pd.to_timedelta(df['T end']+':00'),
            max_end=lambda d: d.groupby('Object')['end'].cummax(),
            group=lambda d: d['start'].ge(d.groupby('Object')['max_end'].shift()).cumsum()
        )
       .groupby(['Object', 'group'])
       .apply(lambda g: g['end'].max()-g['start'].min())
       .groupby(level='Object').sum()
       .dt.total_seconds().div(60)
    )
    

    Output:

    Object
    1    117.0
    2     60.0
    3     75.0
    4     35.0
    dtype: float64
    

    Used input:

    
       Object Machine T start  T end
    0       1       A   17:26  17:57
    1       1       B   17:26  18:33
    2       1       C   18:56  19:46
    3       2       A   14:00  15:00
    4       2       C   14:30  15:00
    5       3       A   12:00  12:30
    6       3       C   13:00  13:45
    7       4       A   12:00  12:30
    8       4       C   12:00  12:15
    9       4       D   12:20  12:35