Search code examples
pythonpandasaveragemean

Python Pandas - Group by rolling average over multiple columns


Let me try and explain what I need help with. Im not the best so please bear with :)

I have a tennis dataset which has player 1 and player 2, sometimes a player is player 1 and sometimes they are player 2.

I have a two stats which I am trying to find the rolling average for both of them.

This is currently what I have

P1 P2 P1_1stWon P2_1stWon
Nadal Murray 5 6
Murray Nadal 7 4

and this is what im after

P1 P2 P1_1stWon P2_1stWon P1_1stWonAvg P2_1stWonAvg
Nadal Murray 5 6 5 6
Murray Nadal 7 4 6.5 4.5

So you can see that for example Nadal is P1 in the first game and scores 5 in the second game he is P2 and scores 4, therefore his average is 4.5 and as he is P2 in the second game you can see his avg score in the P2_1stWonAvg.

This is current the code I have but its not working. Any help is much appreciated :)

df['P1_1stWonAvg'] = df.groupby(['P1', 'P2'])['P1_1stWon', 'P2_1stWon'].apply(pd.rolling, 2, min_periods=2)

Solution

  • One way of approaching this is to create an intermediate dataset on the match-player level, compute rolling averages, then merge back in to the original. Here's how that might look:

    df = pd.DataFrame({
        "P1": ["Nadal", "Murray"],
        "P2": ["Murray", "Nadal"],
        "P1_1stWon": [5, 7],
        "P2_1stWon": [6, 4],
    })
    
    # Set id
    df["match_id"] = range(df.shape[0])
    
    # Create long dataset
    df_list = []
    for i in range(1, 3):
        tmp = df[["match_id", f"P{i}", f"P{i}_1stWon"]].copy()
        tmp.columns = ["match_id", "player", "score"]
        df_list.append(tmp)
    df_long = pd.concat(df_list)
    
    # Get rolling means
    rolling_scores = df_long.sort_values("match_id").groupby("player")["score"].rolling(2, min_periods=1).mean().reset_index()
    
    # Merge back to original data
    for i in range(1, 3):
        tmp = rolling_scores.copy()
        tmp.columns = [f"P{i}", "match_id", f"P{i}_1stWonAvg"]
        df = df.merge(tmp, on=["match_id", f"P{i}"], how="left")
    print(df)
    

    Which outputs:

           P1      P2  P1_1stWon  P2_1stWon  match_id  P1_1stWonAvg  P2_1stWonAvg
    0   Nadal  Murray          5          6         0           5.0           6.0
    1  Murray   Nadal          7          4         1           6.5           4.5