Search code examples
pandasgroup-bycumsum

Pandas group by cumsum keep columns


I have spent a few hours now trying to do a "cumulative group by sum" on a pandas dataframe. I have looked at all the stackoverflow answers and surprisingly none of them can solve my (very elementary) problem:

I have a dataframe:

df1 Out[8]: Name Date Amount 0 Jack 2016-01-31 10 1 Jack 2016-02-29 5 2 Jack 2016-02-29 8 3 Jill 2016-01-31 10 4 Jill 2016-02-29 5

I am trying to

  1. group by ['Name','Date'] and
  2. cumsum 'Amount'.
  3. That is it.

So the desired output is:

df1 Out[10]: Name Date Cumsum 0 Jack 2016-01-31 10 1 Jack 2016-02-29 23 2 Jill 2016-01-31 10 3 Jill 2016-02-29 15

EDIT: I am simplifying the question. With the current answers I still can't get the correct "running" cumsum. Look closely, I want to see the cumulative sum "10, 23, 10, 15". In words, I want to see, at every consecutive date, the total cumulative sum for a person. NB: If there are two entries on one date for the same person, I want to sum those and then add them to the running cumsum and only then print the sum.


Solution

  • You need assign output to new column and then remove Amount column by drop:

    df1['Cumsum'] = df1.groupby(by=['Name','Date'])['Amount'].cumsum()
    df1 = df1.drop('Amount', axis=1)
    print (df1)
       Name        Date  Cumsum
    0  Jack  2016-01-31      10
    1  Jack  2016-02-29       5
    2  Jack  2016-02-29      13
    3  Jill  2016-01-31      10
    4  Jill  2016-02-29       5
    

    Another solution with assign:

    df1 = df1.assign(Cumsum=df1.groupby(by=['Name','Date'])['Amount'].cumsum())
             .drop('Amount', axis=1)
    print (df1)
       Name        Date  Cumsum
    0  Jack  2016-01-31      10
    1  Jack  2016-02-29       5
    2  Jack  2016-02-29      13
    3  Jill  2016-01-31      10
    4  Jill  2016-02-29       5
    

    EDIT by comment:

    First groupby columns Name and Date and aggregate sum, then groupby by level Name and aggregate cumsum.

    df = df1.groupby(by=['Name','Date'])['Amount'].sum()
            .groupby(level='Name').cumsum().reset_index(name='Cumsum')
    print (df)
       Name        Date  Cumsum
    0  Jack  2016-01-31      10
    1  Jack  2016-02-29      23
    2  Jill  2016-01-31      10
    3  Jill  2016-02-29      15