I have a dataframe like this:
DATE ID NUMERIC UNIT
0 2021-01-01 1 55 psig
1 2019-01-01 1 20 psig
2 2018-01-01 2 55 psig
3 2017-01-01 3 40 psig
4 2016-01-01 4 55 cfpd
I am interested in determining if a row meets conditions across the 'NUMERIC', and 'UNIT' columns and if it does, if that 'ID' does not show up later in the dataframe. The 'DATE' field is sorted descending to determine 'later'.
The conditions are df['NUMERIC']>50 & df['UNIT']=='psig'
If an 'ID' meets those conditions and the 'ID' doesn't show up again, it should boolean True.
If it meets or doesn't meet the conditions and shows up later in the df, it should be False. I would like this boolean to be captured in a column named 'FAIL'.
Using the sample df, I would like the resulting df to look like:
DATE ID NUMERIC UNIT FAIL
0 2021-01-01 1 55 psig False
1 2019-01-01 1 20 psig False
2 2018-01-01 2 55 psig True
3 2017-01-01 3 40 psig False
4 2016-01-01 4 55 cfpd False
For context, this df shows inspections over time, if the inspection result is >50 psig, then it fails, but only if it is never inspected again (no follow-up)
I was able to determine if an 'ID' met those conditions in consecutive instances with:
df1 = df.copy()
df1 = df.loc[df['UNIT'] == 'psig']
c1 = df1['NUMERIC'].gt(50) & df1.groupby('ID')['NUMERIC'].shift(-1).gt(50)
c2 = df1['NUMERIC'].gt(50) & df1.groupby('ID')['NUMERIC'].shift().gt(50)
df1 = df1[c1 | c2]
df1['FAIL'] = 'True'
For this new case, I can't filter to only those rows where 'UNIT' is 'psig' because I need to see if the 'ID' shows up at all regardless of the unit. I don't think this method is going to work.
I also explored this question/answer, but it wasn't quite what I was looking for either.
Simply use your two conditions and ensure that there is not the same ID after the current row with duplicated
:
df['FAIL'] = (df['NUMERIC'].gt(50) # is NUMERIC > 50
& df['UNIT'].eq('psig') # and UNIT == 'psig'
& ~df['ID'].duplicated(keep=False) # and last ID
)
Output:
DATE ID NUMERIC UNIT FAIL
0 2021-01-01 1 55 psig False
1 2019-01-01 1 20 psig False
2 2018-01-01 2 55 psig True
3 2017-01-01 3 40 psig False
4 2016-01-01 4 55 cfpd False
Intermediates:
DATE ID NUMERIC UNIT FAIL gt(50) eq('psig') ~duplicated
0 2021-01-01 1 55 psig False True True False
1 2019-01-01 1 20 psig False False True False
2 2018-01-01 2 55 psig True True True True
3 2017-01-01 3 40 psig False False True True
4 2016-01-01 4 55 cfpd False True False True