I am trying to get a cumulative sum using groupby where the cumulative sum is applied to multiple columns that contain the same value
import pandas as pd
import numpy as np
df = pd.DataFrame([['Jazz', 'Clippers', 89, 100],
['Clippers' , 'Jazz', 101, 97],
['Bucks' , 'Jazz', 99, 112],
['Jazz' , 'Bucks', 109, 88]],
columns=['home_team', 'away_team', 'home_points', 'away_points'])
print(df)
This will produce a dataframe with output of
home_team away_team home_points away_points
0 Jazz Clippers 89 100
1 Clippers Jazz 101 97
2 Bucks Jazz 99 112
3 Jazz Bucks 109 88
what I'm trying to do is get the cumulative total points for the home and away team that will account for the fact that each team appears in both home and away columns but all I have been able to figure out is the cumulative total grouped by team name that totals each team as home OR away, like so
df["home_cumulative_points"]= df.groupby(["home_team"])["home_points"].cumsum()
df["away_cumulative_points"]= df.groupby(["away_team"])["away_points"].cumsum()
print(df)
which produces
home_team away_team home_points away_points home_cumulative_points away_cumulative_points
0 Jazz Clippers 89 100 89 100
1 Clippers Jazz 101 97 101 97
2 Bucks Jazz 99 112 99 209
3 Jazz Bucks 109 88 198 88
Is there any way I can groupby to make the cumulative sum account for the presence of the same team in the home and away column to make the running sum add the teams points regardless of if they were home or away? So the ideal output of the last row would be
home_team away_team home_points away_points home_cumulative_points away_cumulative_points
3 Jazz Bucks 109 88 407 187
I'm guessing I may need to do a for loop or something but I'm just not sure how best to go about it. Thanks in advance for any feedback!
Idea is select only necessary columns, split by _
for MultiIndex
, reshape by DataFrame.stack
, so possible use cumsum
per both columns together:
cols = ['home_team', 'away_team', 'home_points', 'away_points']
df1 = df[cols].copy()
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(0).rename_axis(['lev1','lev2'])
df1["cumulative_points"]= df1.groupby(["team", 'lev1'])["points"].cumsum()
df2 = df1.unstack()
df2.columns = df2.columns.map(lambda x: f'{x[1]}_{x[0]}')
print(df2)
away_points home_points away_team home_team away_cumulative_points \
lev1
0 100 89 Clippers Jazz 100
1 97 101 Jazz Clippers 97
2 112 99 Jazz Bucks 112
3 88 109 Bucks Jazz 88
home_cumulative_points
lev1
0 89
1 101
2 99
3 109
Or:
df["home_cumulative_points"]= df1.loc['home', 'cumulative_points']
df["away_cumulative_points"]= df1.loc['away', 'cumulative_points']
Another approach is use concat
with rename
for reshape:
f = lambda x: x.split('_')[1]
df1 = pd.concat([df[['home_team', 'home_points']].rename(columns=f),
df[['away_team', 'away_points']].rename(columns=f)], keys=('home','away'))
df1 = df1.rename_axis(['lev1','lev2'])
df1["cumulative_points"]= df1.groupby(["team", 'lev1'])["points"].cumsum()
df["home_cumulative_points"]= df1.loc['home', 'cumulative_points']
df["away_cumulative_points"]= df1.loc['away', 'cumulative_points']
print(df)
home_team away_team home_points away_points home_cumulative_points \
0 Jazz Clippers 89 100 89
1 Clippers Jazz 101 97 101
2 Bucks Jazz 99 112 99
3 Jazz Bucks 109 88 198
away_cumulative_points
0 100
1 97
2 209
3 88