The following is the original dataframe:
uid timestamp state
1 2015-01-01 fail
2 2015-01-07 fail
2 2015-03-02 fail
1 2015-01-03 pass
1 2015-01-02 warn
2 2015-03-01 pass
1 2015-01-04 pass
1 2015-01-07 pass
2 2015-01-01 warn
This is the resulting dataframe that I would like to generate:
uid timestamp state fail->pass?
1 2015-01-01 fail True
2 2015-01-07 pass False
2 2015-03-02 fail False
1 2015-01-03 pass True
1 2015-01-02 warn True
2 2015-03-01 pass False
1 2015-01-04 pass True
1 2015-01-07 pass True
2 2015-01-01 warn False
The "Fail->Pass?" column is a boolean column that tells you if UID went from a failing state to a passing state. This passing state must be the final state of the UID. The falling state can occur at anytime before the final state. The final state occurs at the latest timestamp for that UID.
What's the most efficient way to create this column? There could potentially be hundreds of state transitions for each UID.
df = pd.DataFrame({'uid': [1, 2, 2, 1, 1, 2, 1, 1, 2],
'timestamp': ['2015-01-01',
'2015-01-07',
'2015-03-02',
'2015-01-03',
'2015-01-02',
'2015-03-01',
'2015-01-04',
'2015-01-07',
'2015-01-01'],
'state': ['fail',
'pass',
'fail',
'pass',
'warn',
'pass',
'pass',
'pass',
'warn'],
'fail->pass?': [True, False, False, True, True, False, True, True, False]})
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(by='timestamp')
fp = (df[['uid','state']].groupby('uid').last()=='pass').reset_index()
fp.columns = ['uid','fail->pass?']
df.merge(fp, on='uid').sort_values(by='timestamp')
Output
uid timestamp state fail->pass?
0 1 2015-01-01 fail True
5 2 2015-01-01 warn False
1 1 2015-01-02 warn True
2 1 2015-01-03 pass True
3 1 2015-01-04 pass True
4 1 2015-01-07 pass True
6 2 2015-01-07 fail False
7 2 2015-03-01 pass False
8 2 2015-03-02 fail False