Search code examples
pythonpandasgrouping

Counting groups of like values within the same column


I have a dataframe containing columns with groups of 1's and -1's. Each column could contain any amount of 1's or -1's grouped together within each column, and they could also begin with an unspecified number of NaN's. I would like to count the total of values within each group excluding the NaN's.

An example dataframe is below:

NaN = float('nan')
df = pd.DataFrame({'col1':[1,1,1,1,-1,-1,-1,1,1],
                   'col2':[NaN, NaN, NaN, 1,1,-1,-1,-1,-1],
                   'col3':[NaN,NaN,NaN,NaN,NaN,1,1,1,-1]})

I would like the output to be another dataframe as shown below:

df =  pd.DataFrame({'col1':[4,4,4,4,3,3,3,2,2],
                    'col2':[NaN, NaN, NaN,2,2,4,4,4,4],
                    'col3': [NaN,NaN,NaN,NaN,NaN,3,3,3,1]})

Solution

  • Assuming a DataFrame/Series, you can group the successive values and use a groupby.transform('size'):

    df = pd.DataFrame({'col': [1,1,1,1,1,-1,-1,-1,-1,1,1,1]})
    df['out'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
                   .transform('size')
                )
    

    Output:

        col  out
    0     1    5
    1     1    5
    2     1    5
    3     1    5
    4     1    5
    5    -1    4
    6    -1    4
    7    -1    4
    8    -1    4
    9     1    3
    10    1    3
    11    1    3
    

    If your input is a list, not need for pandas, go with itertools.groupby/itertools.chain:

    from itertools import groupby, chain
    
    lst = [1,1,1,1,1,-1,-1,-1,-1,1,1,1]
    out = list(chain.from_iterable([l:=len(list(g))]*l for _, g in groupby(lst)))
    

    Output:

    [5, 5, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3]
    
    multiple columns

    If you have a DataFrame with multiple columns, you can transform each column independently with apply and mask the NaNs:

    NaN = float('nan')
    df = pd.DataFrame({'col1':[1,1,1,1,-1,-1,-1,1,1],
                       'col2':[NaN, NaN, NaN, 1,1,-1,-1,-1,-1],
                       'col3':[NaN,NaN,NaN,NaN,NaN,1,1,1,-1]})
    
    out = df.apply(lambda s: s.groupby(s.ne(s.shift()).cumsum())
                              .transform('size').mask(s.isna()))
    

    Output:

       col1  col2  col3
    0     4   NaN   NaN
    1     4   NaN   NaN
    2     4   NaN   NaN
    3     4   2.0   NaN
    4     3   2.0   NaN
    5     3   4.0   3.0
    6     3   4.0   3.0
    7     2   4.0   3.0
    8     2   4.0   1.0