Search code examples
pythonpandasdataframepandas-groupbycumsum

Trying to cumsum() pandas dataframe with same values appearing in multiple columns


I am trying to get a cumulative sum using groupby where the cumulative sum is applied to multiple columns that contain the same value

import pandas as pd
import numpy as np

df = pd.DataFrame([['Jazz', 'Clippers', 89, 100],
                              ['Clippers' , 'Jazz', 101, 97],
                              ['Bucks' , 'Jazz', 99, 112],
                              ['Jazz' , 'Bucks', 109, 88]],
                      columns=['home_team', 'away_team', 'home_points', 'away_points'])
print(df)

This will produce a dataframe with output of

  home_team away_team  home_points  away_points
0      Jazz  Clippers           89          100
1  Clippers      Jazz          101           97
2     Bucks      Jazz           99          112
3      Jazz     Bucks          109           88

what I'm trying to do is get the cumulative total points for the home and away team that will account for the fact that each team appears in both home and away columns but all I have been able to figure out is the cumulative total grouped by team name that totals each team as home OR away, like so

df["home_cumulative_points"]= df.groupby(["home_team"])["home_points"].cumsum() 
df["away_cumulative_points"]= df.groupby(["away_team"])["away_points"].cumsum() 
print(df)

which produces

  home_team away_team  home_points  away_points  home_cumulative_points  away_cumulative_points
0      Jazz  Clippers           89          100                      89                     100
1  Clippers      Jazz          101           97                     101                      97
2     Bucks      Jazz           99          112                      99                     209
3      Jazz     Bucks          109           88                     198                      88

Is there any way I can groupby to make the cumulative sum account for the presence of the same team in the home and away column to make the running sum add the teams points regardless of if they were home or away? So the ideal output of the last row would be

  home_team away_team  home_points  away_points  home_cumulative_points  away_cumulative_points
3      Jazz     Bucks          109           88                     407                      187

I'm guessing I may need to do a for loop or something but I'm just not sure how best to go about it. Thanks in advance for any feedback!


Solution

  • Idea is select only necessary columns, split by _ for MultiIndex, reshape by DataFrame.stack, so possible use cumsum per both columns together:

    cols = ['home_team', 'away_team', 'home_points', 'away_points']
    
    df1 = df[cols].copy()
    df1.columns = df1.columns.str.split('_', expand=True)
    df1 = df1.stack(0).rename_axis(['lev1','lev2'])
    df1["cumulative_points"]= df1.groupby(["team", 'lev1'])["points"].cumsum() 
    
    df2 = df1.unstack()
    df2.columns = df2.columns.map(lambda x: f'{x[1]}_{x[0]}')
    print(df2)
          away_points  home_points away_team home_team  away_cumulative_points  \
    lev1                                                                         
    0             100           89  Clippers      Jazz                     100   
    1              97          101      Jazz  Clippers                      97   
    2             112           99      Jazz     Bucks                     112   
    3              88          109     Bucks      Jazz                      88   
    
          home_cumulative_points  
    lev1                          
    0                         89  
    1                        101  
    2                         99  
    3                        109  
    

    Or:

    df["home_cumulative_points"]= df1.loc['home', 'cumulative_points']
    df["away_cumulative_points"]= df1.loc['away', 'cumulative_points']
    

    Another approach is use concat with rename for reshape:

    f = lambda x: x.split('_')[1]
    df1 = pd.concat([df[['home_team', 'home_points']].rename(columns=f),
                     df[['away_team', 'away_points']].rename(columns=f)], keys=('home','away'))
    df1 = df1.rename_axis(['lev1','lev2'])
    df1["cumulative_points"]= df1.groupby(["team", 'lev1'])["points"].cumsum()
    
    df["home_cumulative_points"]= df1.loc['home', 'cumulative_points']
    df["away_cumulative_points"]= df1.loc['away', 'cumulative_points']
    print(df)
      home_team away_team  home_points  away_points  home_cumulative_points  \
    0      Jazz  Clippers           89          100                      89   
    1  Clippers      Jazz          101           97                     101   
    2     Bucks      Jazz           99          112                      99   
    3      Jazz     Bucks          109           88                     198   
    
       away_cumulative_points  
    0                     100  
    1                      97  
    2                     209  
    3                      88