Search code examples
pythonpandasgroup-bypivot

Pandas: calculate time difference between different milestones in column


I have a table like this:

id tm milestone
00335c06f96a21e4089c49a5da 2023-02-01 18:13:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:14:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:15:42.307543 A
00335c06f96a21e4089c49a5da 2023-02-01 18:19:10.307543 B
00335c06f96a21e4089c49a5da 2023-02-01 18:21:05.307543 C
0043545f6b9112c7e471d5cc81 2023-02-02 08:06:42.307543 A
0043545f6b9112c7e471d5cc81 2023-02-02 08:07:42.307543 A
0043545f6b9112c7e471d5cc81 2023-02-02 09:05:42.307543 B
0043545f6b9112c7e471d5cc81 2023-02-02 09:05:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:05:42.307543 A
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:05:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-12 19:07:42.307543 B
ffe92ae6b0962e800dbdbdf00a 2023-02-13 21:03:42.307543 C

What I want to achieve - is convert it to transactional view / pivot so that for each id to have columns with max time difference between milestones (A, B, C, D..). For starting - columns A-B, A-C.
Treaky part that there are multiple records with same milestone - I am taking min time for each category. But also not all id have full sequence, in this case I need to mark this time difference as None.

For Example:

id A-B, min A-C, min
00335c06f96a21e4089c49a5da 5.47 7.38
0043545f6b9112c7e471d5cc81 59.0 None
ffe92ae6b0962e800dbdbdf00a 0 1558.0

What I'm trying to do:
Take min values for each milestone: df.groupby(['id','milestone '])['tm'].min().to_frame('tm').reset_index(),

And apply .diff() like .sort_values(['id','tm']).groupby('id')['tm'].diff() But can't feagure out how to properly do this to create time-difference columns.
Other idea is to use .pivot but can't come up with idea how to aggregate milestones inside.

Appreciate any help.

Input DataFrame for example:

pd.DataFrame({'id':['00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a'],
             'tm':['2023-02-01 18:13:42.307543','2023-02-01 18:14:42.307543','2023-02-01 18:15:42.307543','2023-02-01 18:19:10.307543','2023-02-01 18:21:05.307543',
                  '2023-02-02 08:06:42.307543','2023-02-02 08:07:42.307543','2023-02-02 09:05:42.307543','2023-02-02 09:05:42.307543',
                  '2023-02-12 19:05:42.307543','2023-02-12 19:05:42.307543','2023-02-12 19:07:42.307543','2023-02-13 21:03:42.307543'],
              'milestone':['A','A','A','B','C','A','A','B','B','A','B','B','C']})

Solution

  • Update

    I am taking min time for each category

    Just change the aggregation function of pivot_table:

    df['tm'] = pd.to_datetime(df['tm'])  # maybe unnecessary
    
    m = df['milestone'] == 'A'  # first milestone or not
    
    out = (pd.merge(df[m], df[~m], on='id')
             .assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
                     tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() / 60)
             .pivot_table(index='id', columns='milestone', values='tm', aggfunc='first')
             .round(2).rename_axis(columns=None).reset_index())
    

    Output:

    >>> out
                               id    A-B      A-C
    0  00335c06f96a21e4089c49a5da   5.47     7.38
    1  0043545f6b9112c7e471d5cc81  59.00      NaN
    2  ffe92ae6b0962e800dbdbdf00a   0.00  1558.00
    

    IIUC, you want to compare time between the first milestone (A) and other ones, so you can use merge before reshaping your dataframe with pivot_table:

    df['tm'] = pd.to_datetime(df['tm'])  # maybe unnecessary
    
    m = df['milestone'] == 'A'  # first milestone or not
    
    out = (pd.merge(df[m], df[~m], on='id')
             .assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
                     tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() // 60)
             .pivot_table(index='id', columns='milestone', values='tm', aggfunc=min)
             .rename_axis(columns=None).reset_index())
    

    Output:

    >>> out
                               id   A-B     A-C
    0  00335c06f96a21e4089c49a5da   3.0     5.0
    1  0043545f6b9112c7e471d5cc81  58.0     NaN
    2  ffe92ae6b0962e800dbdbdf00a   0.0  1558.0