Search code examples
pythonpandasgroup-bycumsum

Groupby/cumsum for dataframe with duplicate names


I'm trying to perform a cumulative sum on a dataframe that contains multiple identical names. I'd like to create another df that has a cumulative sum of the points scored per player, while also recognizing that names sometimes are not unique. The school would be the 2nd criteria. Here's an example of what I'm looking at:

df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
           'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
           'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
           'Points Scored':['20', '30', '15', '8', '9']})

print(df)

     Player       School     Date    Points Scored
0  John Smith      Duke   1-1-20            20
1  John Smith      Duke   1-3-20            30
2  John Smith      Duke   1-7-20            15
3  John Smith  Kentucky   1-3-20             8
4  John Smith  Kentucky  1-08-20             9

I've tried using df.groupby(by=['Player', 'School', 'Date']).sum().groupby(level=[0]).cumsum()... but that doesn't seem to differentiate the second criteria. I've also tried to sort_values by School but couldn't find any luck there. The expected output would look like the below table;

  Player        School              Date     Points Scored  Cumulative Sum Points Scored
0  John Smith   Duke                  1-1-20          20              20                   
1  John Smith   Duke                  1-3-20          30              50
2  John Smith   Duke                  1-7-20          15              65
3  John Smith   Kentucky              1-3-20           8              8
4  John Smith   Kentucky              1-08-20          9              17

Thanks in advance for the help!


Solution

  • import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John     Smith', 'John Smith'],
           'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
           'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
           'Points Scored':[20, 30, 15, 8, 9]}) # change to integer here
    
    df['Cumulative Sum Points Scored'] = df.groupby(['Player','School'])['Points Scored'].apply(np.cumsum)
    

    Output:

       Player         School  Date         Points Scored      Cumulative Sum Points Scored
    0  John Smith      Duke   1-1-20             20                            20
    1  John Smith      Duke   1-3-20             30                            50
    2  John Smith      Duke   1-7-20             15                            65
    3  John Smith  Kentucky   1-3-20              8                             8
    4  John Smith  Kentucky  1-08-20              9                            17