Search code examples
pythonpandasdataframegroup-by

Count consecutive identical values in pandas dataframe and compute cumulative sum of occurrences


I have the following DataFrame:

df = 

sample_type    data_window
   A                1           
   A                1           
   A                1           
   B                1           
   B                1           
   B                8
   B                1           
   C                2           
   C                1   
   C                1           

I'm trying to count the maximum consecutive identical values =1 in data_window for each sample_type.

I'd like an output DataFrame like below:

df = 

sample_type    cum_count    
   A                3                     
   B                2            
   C                2 

I've tried an approach similar to this: groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences, using groupby and cumsum, but can't get it to work.

Any help would be greatly appreciated!


Solution

  • You can use groupby_agg with a custom function:

    max_consecutive_zeroes = lambda x: x.loc[x.eq(1)].groupby(x.ne(1).cumsum()).size().max()
    out = df.groupby('sample_type', as_index=False)['data_window'].agg(max_consecutive_zeroes)
    

    Output:

    >>> out
      sample_type  data_window
    0           A            3
    1           B            2
    2           C            2
    

    Example.

    Suppose the following series (only one group, so don't need the initial groupby_agg here):

    >>> x
    0    1  # First group (2)
    1    1  #
    2    6
    3    1  # Second group (1)
    4    7
    5    9
    6    1  # Third group (3, the longest)
    7    1  #
    8    1  #
    9    3
    dtype: int64
    

    Step by step:

    # Remove other rows where values are not 1
    >>> out = x.loc[x.eq(1)]
    0    1
    1    1
    3    1
    6    1
    7    1
    8    1
    dtype: int64
    
    # Create virtual groups based on values are not 1
    >>> out = out.groupby(x.ne(1).cumsum())
    [(0,  # First group
      0    1
      1    1
      dtype: int64),
     (1,  # Second group
      3    1
      dtype: int64),
     (3,  # Third group
      6    1
      7    1
      8    1
      dtype: int64)]
    
    # Get the size and keep the max
    >>> out.size().max()
    3