Search code examples
pythonpandasbooleancumsum

count sets of consecutive true values in a column


Let's say that I have a dataframe as follow:

df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})

Then, I convert it into a boolean form:

df.eq(1)
Out[213]: 
        A
0    True
1    True
2    True
3    True
4    True
5   False
6   False
7    True
8    True
9   False
10   True
11   True
12   True
13   True
14   True
15  False
16  False
17  False
18  False
19  False
20   True
21   True

What I want is to count consecutive sets of True values in the column. In this example, the output would be:

    df
Out[215]: 
    A  count
0   1    5.0
1   1    2.0
2   1    5.0
3   1    2.0
4   1    NaN
5   0    NaN
6   0    NaN
7   1    NaN
8   1    NaN
9   0    NaN
10  1    NaN
11  1    NaN
12  1    NaN
13  1    NaN
14  1    NaN
15  0    NaN
16  0    NaN
17  0    NaN
18  0    NaN
19  0    NaN
20  1    NaN
21  1    NaN

My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance


Solution

  • You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:

    # A side-by-side comparison:
    >>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
        A  A
    0   1  1
    1   1  1
    2   1  1
    3   1  1
    4   1  1
    5   0  2
    6   0  2
    7   1  3
    8   1  3
    9   0  4
    10  1  5
    11  1  5
    12  1  5
    13  1  5
    14  1  5
    15  0  6
    16  0  6
    17  0  6
    18  0  6
    19  0  6
    20  1  7
    21  1  7
    

    Thus, group by that grouper, calculate sums, replace zero with NaN + dropna, and reset the index:

    df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)
    

    Output:

    >>> df
        A    B
    0   1  5.0
    1   1  2.0
    2   1  5.0
    3   1  2.0
    4   1  NaN
    5   0  NaN
    6   0  NaN
    7   1  NaN
    8   1  NaN
    9   0  NaN
    10  1  NaN
    11  1  NaN
    12  1  NaN
    13  1  NaN
    14  1  NaN
    15  0  NaN
    16  0  NaN
    17  0  NaN
    18  0  NaN
    19  0  NaN
    20  1  NaN
    21  1  NaN