I'm trying to perform a cumulative sum on a dataframe that contains multiple identical names. I'd like to create another df that has a cumulative sum of the points scored per player, while also recognizing that names sometimes are not unique. The school would be the 2nd criteria. Here's an example of what I'm looking at:
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':['20', '30', '15', '8', '9']})
print(df)
Player School Date Points Scored
0 John Smith Duke 1-1-20 20
1 John Smith Duke 1-3-20 30
2 John Smith Duke 1-7-20 15
3 John Smith Kentucky 1-3-20 8
4 John Smith Kentucky 1-08-20 9
I've tried using df.groupby(by=['Player', 'School', 'Date']).sum().groupby(level=[0]).cumsum()... but that doesn't seem to differentiate the second criteria. I've also tried to sort_values by School but couldn't find any luck there. The expected output would look like the below table;
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17
Thanks in advance for the help!
import numpy as np
import pandas as pd
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':[20, 30, 15, 8, 9]}) # change to integer here
df['Cumulative Sum Points Scored'] = df.groupby(['Player','School'])['Points Scored'].apply(np.cumsum)
Output:
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17