Search code examples
pythonpandasdataframeseriesflags

Pandas How to flag consecutive values ignoring the first occurrence


I have the following code:

data={'id':[1,2,3,4,5,6,7,8,9,10,11],
  'value':[1,0,1,0,1,1,1,0,0,1,0]}
  df=pd.DataFrame.from_dict(data)
  df
  Out[8]: 
      id  value
 0    1      1
 1    2      0
 2    3      1
 3    4      0
 4    5      1
 5    6      1
 6    7      1
 7    8      0
 8    9      0
 9    10     1
10    11     0

I want to create a flag column that indicate with 1 consecutive values starting from the second occurrence and ignoring the first.

With the actual solution:

df['flag'] = 
df.value.groupby([df.value,df.flag.diff().ne(0).cumsum()]).transform('size').ge(3).astype(int)


Out[8]: 
        id  value  flag
    0    1      1     0
    1    2      0     0
    2    3      1     0
    3    4      0     0
    4    5      1     1
    5    6      1     1
    6    7      1     1
    7    8      0     1
    8    9      0     1
    9    10     1     0
   10    11     0     0

While I need a solution like this, where the first occurence is flagged as 0 and 1 starting from the second:

Out[8]: 
        id  value  flag
    0    1      1    0
    1    2      0    0
    2    3      1    0
    3    4      0    0
    4    5      1    0
    5    6      1    1
    6    7      1    1
    7    8      0    0
    8    9      0    1
    9    10     1    0
   10    11     0    0

Solution

  • Create consecutive groups by compared Series.shifted values by not equal and Series.cumsum, create counter by GroupBy.cumcount and compare if greater values like 0 by Series.gt, last map True, False to 1, 0 by casting to integers by Series.astype:

    df['flag'] = (df.groupby(df['value'].ne(df['value'].shift()).cumsum())
                    .cumcount()
                    .gt(0)
                    .astype(int))
    print (df)
        id  value  flag
    0    1      1     0
    1    2      0     0
    2    3      1     0
    3    4      0     0
    4    5      1     0
    5    6      1     1
    6    7      1     1
    7    8      0     0
    8    9      0     1
    9   10      1     0
    10  11      0     0
    

    How it working:

    print (df.assign(g = df['value'].ne(df['value'].shift()).cumsum(),
                     counter = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount(),
                     mask = df.groupby(df['value'].ne(df['value'].shift()).cumsum()).cumcount().gt(0)))
    
        id  value  g  counter   mask
    0    1      1  1        0  False
    1    2      0  2        0  False
    2    3      1  3        0  False
    3    4      0  4        0  False
    4    5      1  5        0  False
    5    6      1  5        1   True
    6    7      1  5        2   True
    7    8      0  6        0  False
    8    9      0  6        1   True
    9   10      1  7        0  False
    10  11      0  8        0  False