Search code examples
pythonpandasdataframevectorizationrunning-count

creating a column which keeps a running count of consecutive values


I am trying to create a column (“consec”) which will keep a running count of consecutive values in another (“binary”) without using loop. This is what the desired outcome would look like:

.    binary consec
1       0      0
2       1      1
3       1      2
4       1      3
5       1      4
5       0      0
6       1      1
7       1      2
8       0      0

However, this...

df['consec'][df['binary']==1] = df['consec'].shift(1) + df['binary']

results in this...

.  binary   consec
0     1       NaN
1     1       1
2     1       1
3     0       0
4     1       1
5     0       0
6     1       1
7     1       1
8     1       1
9     0       0

I see other posts which use grouping or sorting, but unfortunately, I don't see how that could work for me.


Solution

  • You can use the compare-cumsum-groupby pattern (which I really need to getting around to writing up for the documentation), with a final cumcount:

    >>> df = pd.DataFrame({"binary": [0,1,1,1,0,0,1,1,0]})
    >>> df["consec"] = df["binary"].groupby((df["binary"] == 0).cumsum()).cumcount()
    >>> df
       binary  consec
    0       0       0
    1       1       1
    2       1       2
    3       1       3
    4       0       0
    5       0       0
    6       1       1
    7       1       2
    8       0       0
    

    This works because first we get the positions where we want to reset the counter:

    >>> (df["binary"] == 0)
    0     True
    1    False
    2    False
    3    False
    4     True
    5     True
    6    False
    7    False
    8     True
    Name: binary, dtype: bool
    

    The cumulative sum of these gives us a different id for each group:

    >>> (df["binary"] == 0).cumsum()
    0    1
    1    1
    2    1
    3    1
    4    2
    5    3
    6    3
    7    3
    8    4
    Name: binary, dtype: int64
    

    And then we can pass this to groupby and use cumcount to get an increasing index in each group.