I have a table like this:
id | tm | milestone |
---|---|---|
00335c06f96a21e4089c49a5da | 2023-02-01 18:13:42.307543 | A |
00335c06f96a21e4089c49a5da | 2023-02-01 18:14:42.307543 | A |
00335c06f96a21e4089c49a5da | 2023-02-01 18:15:42.307543 | A |
00335c06f96a21e4089c49a5da | 2023-02-01 18:19:10.307543 | B |
00335c06f96a21e4089c49a5da | 2023-02-01 18:21:05.307543 | C |
0043545f6b9112c7e471d5cc81 | 2023-02-02 08:06:42.307543 | A |
0043545f6b9112c7e471d5cc81 | 2023-02-02 08:07:42.307543 | A |
0043545f6b9112c7e471d5cc81 | 2023-02-02 09:05:42.307543 | B |
0043545f6b9112c7e471d5cc81 | 2023-02-02 09:05:42.307543 | B |
ffe92ae6b0962e800dbdbdf00a | 2023-02-12 19:05:42.307543 | A |
ffe92ae6b0962e800dbdbdf00a | 2023-02-12 19:05:42.307543 | B |
ffe92ae6b0962e800dbdbdf00a | 2023-02-12 19:07:42.307543 | B |
ffe92ae6b0962e800dbdbdf00a | 2023-02-13 21:03:42.307543 | C |
What I want to achieve - is convert it to transactional view / pivot so that for each id
to have columns with max time difference between milestones (A, B, C, D..). For starting - columns A-B
, A-C
.
Treaky part that there are multiple records with same milestone - I am taking min time for each category. But also not all id have full sequence, in this case I need to mark this time difference as None.
For Example:
id | A-B, min | A-C, min |
---|---|---|
00335c06f96a21e4089c49a5da | 5.47 | 7.38 |
0043545f6b9112c7e471d5cc81 | 59.0 | None |
ffe92ae6b0962e800dbdbdf00a | 0 | 1558.0 |
What I'm trying to do:
Take min values for each milestone:
df.groupby(['id','milestone '])['tm'].min().to_frame('tm').reset_index(),
And apply .diff()
like .sort_values(['id','tm']).groupby('id')['tm'].diff()
But can't feagure out how to properly do this to create time-difference columns.
Other idea is to use .pivot
but can't come up with idea how to aggregate milestones inside.
Appreciate any help.
Input DataFrame for example:
pd.DataFrame({'id':['00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','00335c06f96a21e4089c49a5da','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','0043545f6b9112c7e471d5cc81','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a','ffe92ae6b0962e800dbdbdf00a'],
'tm':['2023-02-01 18:13:42.307543','2023-02-01 18:14:42.307543','2023-02-01 18:15:42.307543','2023-02-01 18:19:10.307543','2023-02-01 18:21:05.307543',
'2023-02-02 08:06:42.307543','2023-02-02 08:07:42.307543','2023-02-02 09:05:42.307543','2023-02-02 09:05:42.307543',
'2023-02-12 19:05:42.307543','2023-02-12 19:05:42.307543','2023-02-12 19:07:42.307543','2023-02-13 21:03:42.307543'],
'milestone':['A','A','A','B','C','A','A','B','B','A','B','B','C']})
Update
I am taking min time for each category
Just change the aggregation function of pivot_table
:
df['tm'] = pd.to_datetime(df['tm']) # maybe unnecessary
m = df['milestone'] == 'A' # first milestone or not
out = (pd.merge(df[m], df[~m], on='id')
.assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() / 60)
.pivot_table(index='id', columns='milestone', values='tm', aggfunc='first')
.round(2).rename_axis(columns=None).reset_index())
Output:
>>> out
id A-B A-C
0 00335c06f96a21e4089c49a5da 5.47 7.38
1 0043545f6b9112c7e471d5cc81 59.00 NaN
2 ffe92ae6b0962e800dbdbdf00a 0.00 1558.00
IIUC, you want to compare time between the first milestone (A) and other ones, so you can use merge
before reshaping your dataframe with pivot_table
:
df['tm'] = pd.to_datetime(df['tm']) # maybe unnecessary
m = df['milestone'] == 'A' # first milestone or not
out = (pd.merge(df[m], df[~m], on='id')
.assign(milestone=lambda x: x.pop('milestone_x') + '-' + x.pop('milestone_y'),
tm=lambda x: x.pop('tm_y').sub(x.pop('tm_x')).dt.total_seconds() // 60)
.pivot_table(index='id', columns='milestone', values='tm', aggfunc=min)
.rename_axis(columns=None).reset_index())
Output:
>>> out
id A-B A-C
0 00335c06f96a21e4089c49a5da 3.0 5.0
1 0043545f6b9112c7e471d5cc81 58.0 NaN
2 ffe92ae6b0962e800dbdbdf00a 0.0 1558.0