Search code examples
pythonpandasdataframegroup-byrunning-count

Pandas: number rows within group cumulatively and across another group


Given the following dataframe:

    col_1 col_2 col_3
0     1     A     1
1     1     B     1
2     2     A     3
3     2     A     3
4     2     A     3
5     2     B     3
6     2     B     3
7     2     B     3
8     3     A     2
9     3     A     2
10    3     C     2
11    3     C     2

I need to create a new column in which the rows are numbered cumulatively within each group formed by 'col_1' and 'col_2', but also cumulatively after each group of 'col_1', like this:

    col_1 col_2 col_3  new
0     1     A     1     1
1     1     B     1     1
2     2     A     3     2
3     2     A     3     3
4     2     A     3     4
5     2     B     3     2
6     2     B     3     3
7     2     B     3     4
8     3     A     2     5
9     3     A     2     6
10    3     C     2     5
11    3     C     2     6

I've tried:

df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1

But this doesn't add up from the previous group as intended.


Solution

  • This is a tricky problem. You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply. That can be done with a max + cumsum of this cumcount over the previous groups. Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.

    # Cumcount within group
    s = df.groupby(['col_1', 'col_2']).cumcount()
    
    # Determine how many cumcounts were within all previous groups of `col_1' 
    to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')
    
    # Link group with prior group label
    df1 = df[['col_1']].drop_duplicates()
    df1['col_1_shift'] = df1['col_1'].shift(-1)
    df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)
    
    # Bring the group offset over
    df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')
    
    # Add the group offset to the cumulative count within group.
    # First group (no previous group) is NaN so fill with 1.
    df['new'] = df['new'].fillna(1, downcast='infer') + s
    
    # Clean up merging column
    df = df.drop(columns='col_1_shift')
    

        col_1 col_2  col_3  new
    0       1     A      1    1
    1       1     B      1    1
    2       2     A      3    2
    3       2     A      3    3
    4       2     A      3    4
    5       2     B      3    2
    6       2     B      3    3
    7       2     B      3    4
    8       3     A      2    5
    9       3     A      2    6
    10      3     C      2    5
    11      3     C      2    6