Search code examples
pythonpandasseries

Determine if a set of values appears across columns once, then one value from those columns never appears again in the df


I have a dataframe like this:

         DATE  ID  NUMERIC  UNIT
0  2021-01-01   1       55  psig
1  2019-01-01   1       20  psig
2  2018-01-01   2       55  psig
3  2017-01-01   3       40  psig
4  2016-01-01   4       55  cfpd

I am interested in determining if a row meets conditions across the 'NUMERIC', and 'UNIT' columns and if it does, if that 'ID' does not show up later in the dataframe. The 'DATE' field is sorted descending to determine 'later'.

The conditions are df['NUMERIC']>50 & df['UNIT']=='psig' If an 'ID' meets those conditions and the 'ID' doesn't show up again, it should boolean True. If it meets or doesn't meet the conditions and shows up later in the df, it should be False. I would like this boolean to be captured in a column named 'FAIL'.

Using the sample df, I would like the resulting df to look like:

         DATE  ID  NUMERIC  UNIT   FAIL
0  2021-01-01   1       55  psig  False
1  2019-01-01   1       20  psig  False
2  2018-01-01   2       55  psig   True
3  2017-01-01   3       40  psig  False
4  2016-01-01   4       55  cfpd  False

For context, this df shows inspections over time, if the inspection result is >50 psig, then it fails, but only if it is never inspected again (no follow-up)

I was able to determine if an 'ID' met those conditions in consecutive instances with:

df1 = df.copy()

df1 = df.loc[df['UNIT'] == 'psig']

c1 = df1['NUMERIC'].gt(50) & df1.groupby('ID')['NUMERIC'].shift(-1).gt(50)
c2 = df1['NUMERIC'].gt(50) & df1.groupby('ID')['NUMERIC'].shift().gt(50)

df1 = df1[c1 | c2]

df1['FAIL'] = 'True'

For this new case, I can't filter to only those rows where 'UNIT' is 'psig' because I need to see if the 'ID' shows up at all regardless of the unit. I don't think this method is going to work.

I also explored this question/answer, but it wasn't quite what I was looking for either.


Solution

  • Simply use your two conditions and ensure that there is not the same ID after the current row with duplicated:

    df['FAIL'] = (df['NUMERIC'].gt(50)               # is NUMERIC > 50
                  & df['UNIT'].eq('psig')            # and UNIT == 'psig'
                  & ~df['ID'].duplicated(keep=False) # and last ID
                 )
    

    Output:

             DATE  ID  NUMERIC  UNIT   FAIL
    0  2021-01-01   1       55  psig  False
    1  2019-01-01   1       20  psig  False
    2  2018-01-01   2       55  psig   True
    3  2017-01-01   3       40  psig  False
    4  2016-01-01   4       55  cfpd  False
    

    Intermediates:

             DATE  ID  NUMERIC  UNIT   FAIL  gt(50)  eq('psig')  ~duplicated
    0  2021-01-01   1       55  psig  False    True        True        False
    1  2019-01-01   1       20  psig  False   False        True        False
    2  2018-01-01   2       55  psig   True    True        True         True
    3  2017-01-01   3       40  psig  False   False        True         True
    4  2016-01-01   4       55  cfpd  False    True       False         True