Search code examples
pythonpandasdataframepandas-groupby

Pandas: Creating indicator column after condition


import numpy as np
import pandas as pd
df = pd.DataFrame({
   'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'Array':  ['S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','SS','TT'],
   'Area': [3.0, 2.0, 2.88, 1.33,  2.44, 1.25, 1.53, 1.0, 0.156, 2.0, 2.4, 6.3, 6.9, 9.78, 10.2, 3.0, 16.0, 19.0]
})
print(df)

I am trying to make an indicator column that indicates if the area being a certain size has already happened. So for example, if the cond. is A, then I want to indicate the first time that the area is <=1.5 (and for all data points after), and if the cond. is B, then to indicate the first time the area >10 (and for all points after). The final result should look like:

   cond Array    Area   Indicator
0     A     S   3.000        0
1     A     S   2.000        0
2     A    TT   2.880        0
3     A    TT   1.330        1
4     A     S   2.440        1
5     A     S   1.250        1
6     A    TT   1.530        1
7     A    TT   1.000        1
8     A     S   0.156        1
9     B     S   2.000        0
10    B    TT   2.400        0
11    B    TT   6.300        0
12    B     S   6.900        0
13    B     S   9.780        0
14    B    TT  10.200        1
15    B    TT   3.000        1
16    B    SS  16.000        1
17    B    TT  19.000        1

A lot of the other examples I looked at were to indicate if the area for A was <=1.5, or indicate the first time it happens, but not to indicate the first time it happens AND indicate all the datapoints after. The idea is that once my condition hits a certain area, it enters a different "phase" and I'm trying to indicate when "A" enters and stays in that phase (and the equivalent for B).


Solution

  • You can make write the conditions, and then group by cond and use cumsum + clip:

    mask = (df['cond'].eq('A') & df['Area'].lt(1.5)) | (df['cond'].eq('B') & df['Area'].gt(10))
    df['Indicator'] = mask.groupby(df['cond']).cumsum().clip(0, 1)
    

    Output:

    >>> df
       cond Array    Area  Indicator
    0   A    S     3.000   0        
    1   A    S     2.000   0        
    2   A    TT    2.880   0        
    3   A    TT    1.330   1        
    4   A    S     2.440   1        
    5   A    S     1.250   1        
    6   A    TT    1.530   1        
    7   A    TT    1.000   1        
    8   A    S     0.156   1        
    9   B    S     2.000   0        
    10  B    TT    2.400   0        
    11  B    TT    6.300   0        
    12  B    S     6.900   0        
    13  B    S     9.780   0        
    14  B    TT    10.200  1        
    15  B    TT    3.000   1        
    16  B    SS    16.000  1        
    17  B    TT    19.000  1