Search code examples
pythonpandasdataframepandas-groupbyseries

Apply multiple criteria to select current and prior row - Pandas


I have a dataframe like as shown below

person_id  source_system   r_diff
  1              O          NULL
  1              O           0
  1              O           9
  1              O           NULL
  2              O           574
  2              I           20
  2              O           135
  2              O           0
  2              I           21
  2              O           2
  2              O           0
  2              O           0
  2              I           12 

I would like to select rows based on the criteria below

criteria 1 - pick all rows where source-system = I

criteria 2 - pick prior row (n-1) only when source-system of (n-1)th is O and diff is zero.

This criteria 2 should be applied only when nth row has source-system = I. If source-system of (n-1)th is I, we don't have to do anything because criteria 1 will handle that.

We have to apply both the criteria each person

I tried the below based on SO suggestion but not sure how to make it work

m1 = df['visit_source_value'] == 'I'
m2 = df['diff'] <= 0
m3 = df.groupby('person_id')['diff'].shift(-1) <= 0 

df = df1[m1 | m2 | m3]

I expect my output to be like as shown below

  2              I           20
  2              O           0
  2              I           21
  2              O           0
  2              I           12 

Solution

  • I prefer not one line solution, because hard readable if more complicated code, so better is use:

    m1 = df['visit_source_value'] == 'I'
    m2 = df['r_diff'] <= 0
    m3 = df.groupby('person_id')['visit_source_value'].shift(-1) == 'I'
    
    df = df[m1 | (m2 & m3)]
    
    print (df)
        person_id visit_source_value  r_diff
    5           2                  I    20.0
    7           2                  O     0.0
    8           2                  I    21.0
    11          2                  O     0.0
    12          2                  I    12.0