Search code examples
pythonpandasgroup-by

extract consecutive rows with similar values in a column more with a specific patch size


I was looking out to extract consecutive rows with specified text repeated continuously for more than 5 times.

ex:

 A  B   C 
10  john    1
12  paul    1
23  kishan  1
12  teja    1
12  zebo    1
324 vauh    -1
3434    krish   -1
232 poo -1
4535    zoo 1
4343    doo 1
342 foo -1
123 soo 1
121 koo -1
34  loo -1
343454  moo -1
565343  noo -1
2323234 voo -1
3434    coo 1
545 xoo 1
6565    zoo 1
232321  qoo 1
34454   woo 1
546556  eoo 1
65665   roo -1
5343    too -1
3232    yoo 1
1212    uoo 1
23355667    ioo 1
787878  joo -1

I am looking out for the below result where the column value 'c' has consecutive 1's repeated more than 4 times as different groups .

Output:

A   B   C   group
10  john    1   1
12  paul    1   1
23  kishan  1   1
12  teja    1   1
12  zebo    1   1
3434 coo    1   2
545  xoo    1   2
6565    zoo 1   2
232321  qoo 1   2
34454   woo 1   2
546556  eoo 1   2

Solution

  • Using masks and factorize:

    # identify 1s
    m = df['C'].eq(1)
    # group consecutive values
    g = m.ne(m.shift()).cumsum()
    # identify stretches of 5+ 1s
    m2 = m & df.groupby(g)['C'].transform('size').ge(5)
    
    out = (df.loc[m2]
             .assign(group=pd.factorize(g[m2])[0]+1)
           )
    

    Output:

             A       B  C  group
    0       10    john  1      1
    1       12    paul  1      1
    2       23  kishan  1      1
    3       12    teja  1      1
    4       12    zebo  1      1
    17    3434     coo  1      2
    18     545     xoo  1      2
    19    6565     zoo  1      2
    20  232321     qoo  1      2
    21   34454     woo  1      2
    22  546556     eoo  1      2