Search code examples
pythonpandaspandas-groupbycumsum

Cumulative sum in pandas starting with a zero and ending with the sum of all but the last entry respecting groups


In the dataframe below, I want to create a new column C which will be the cumulative sum of B for each group in the A column, but these sums must start at zero and only add the values until the penultimate entry for that group.

  A B
0 1 5
1 1 6
2 2 3
3 2 4
4 2 5
5 3 2
5 3 7
6 4 3

So, my result should be:

  A B C
0 1 5 0
1 1 6 5
2 2 3 0
3 2 4 3
4 2 5 7
5 3 2 0
5 3 7 2
6 4 3 0

(I think this question is really obvious but somehow I couldn't figure it out myself nor I could see it anywhere asked already.)


Solution

  • Another option is to use .groupby() twice, as follows:

    Take the DataFrameGroupBy.shift() value of B under A so that for each group of A, the first entry of B will be reset and become NaN for later .fillna() to 0.

    Further grouped by A for GroupBy.cumsum() within the local sequence of A to get the desired output:

    df['C'] = (df.groupby('A')['B'].shift()
                 .groupby(df['A']).cumsum()
                 .fillna(0, downcast='infer')
              )
    

    This solution is vectorized as well as supporting non-contiguous groups too!

    Result:

    print(df)
    
    
       A  B  C
    0  1  5  0
    1  1  6  5
    2  2  3  0
    3  2  4  3
    4  2  5  7
    5  3  2  0
    5  3  7  2
    6  4  3  0
    

    Edit

    If you are to groupby more than one column and you got "KeyError", check whether your syntax is correctly entered, for example:

    If you groupby 2 columns year and income, you can use:

    df['C'] = (df.groupby(['year', 'income'])['B'].shift()
                 .groupby([df['year'], df['income']]).cumsum()
                 .fillna(0, downcast='infer')
              )
    

    Pandas supports both syntax with or without quoting df passing parameter to .groupby(). However, for any groupby() that the entity being grouped is not df itself, we may not be able to use the abbreviated form to quote just the column labels only e.g. 'year', we need to use the full column name e.g. df['year'] instead.