Search code examples
pythonpandasnumpyrolling-average

Find running average which equal


Let say I have some data associated with football

Date   Home     Away  HomeGoal AwayGoal TotalGoal
2019   Arsenal  MU     5        1        6
2019   MCity    Liv    2        2        4
2019   MU       Liv    3        4        7
2019   MCity    MU     0        0        0

I want to create a column of data which show the average goals of that team in the recent 2 games. For example in the last row, I want to include a column that shows the average goal of MU in their last 2 games, which = (1+3)/2 = 2.

Is there any functions in python can achieve this?


Solution

  • For your requirement, you don't care if a team is Home or Away, only how many goal it scored per match. Try this:

    # Rename the columns to make the unstacking operation a bit easier
    # Always a good idea to specify an explicit `copy` when you intend
    # to change the dataframe structure
    >>> tmp = df[['Home', 'Away', 'HomeGoal', 'AwayGoal']].copy()
    
    # Arrange the columns into a MultiIndex to make stacking easier
    >>> tmp.columns = pd.MultiIndex.from_product([['Team', 'Goal'], ['Home', 'Away']])
    
    # This is what `tmp` look like:
    
               Team      Goal     
          Home Away Home Away
    0  Arsenal   MU    5    1
    1    MCity  Liv    2    2
    2       MU  Liv    3    4
    3    MCity   MU    0    0
    
    # And now the magic
    >>> tmp.stack() \
            .groupby('Team').rolling(2).mean() \
            .groupby('Team').tail(1) \
            .droplevel([1,2])
    
    # Result
             Goal
    Team         
    Arsenal   NaN
    Liv       3.0
    MCity     1.0
    MU        1.5
    

    Here's how it works:

    • stack unpivots Home and Away so that for every match, we have 2 rows for Teams and Goal
    • groupby('Team').rolling(2).mean() gets the rolling average of goal scored over the last 2 games per team
    • groupby('Team').tail(1) gets the last of those rolling averages per team
    • By this time, the transitional dataframe has 3 levels in its index: team's name, match number and home/away indicator of the last game played. We only care about the first, so we will drop the other 2.