Search code examples
pandaspandas-groupbymoving-averagerolling-computation

pandas moving average by group calculation is wrong


My dataframe priceDF begins like this:

    Date        symbol  Close
0   2000-01-03  HELN.SW 28.28
1   2000-01-04  HELN.SW 27.50
2   2000-01-05  HELN.SW 26.71
3   2000-01-06  HELN.SW 27.16
4   2000-01-07  HELN.SW 27.50

To calculate the moving average of the Closes, grouped by symbol I did:

priceDF['avg'] = priceDF.groupby('symbol')'Close'].rolling(3).mean().reset_index(drop=True)

and I get:

    Date        symbol  Close   avg
0   2000-01-03  HELN.SW 28.28   NaN
1   2000-01-04  HELN.SW 27.50   NaN
2   2000-01-05  HELN.SW 26.71   12.537398
3   2000-01-06  HELN.SW 27.16   12.022164
4   2000-01-07  HELN.SW 27.50   11.922733

In row 2, I want avg = 27.50, the average of the Closes in rows 0 to 2. What do I misunderstand?


Solution

  • I am pretty sure this is an issue with how reset_index is used here,in conjuction with having multiple groups. Let us consider a slightly extended example:

    from io import StringIO
    
    data = StringIO(
    '''
       Date        symbol  Close
    0   2000-01-03  HELN.SW 28.28
    1   2000-01-04  HELN.SW 27.50
    2   2000-01-05  HELN.SW 26.71
    3   2000-01-06  HELN.SW 27.16
    4   2000-01-07  HELN.SW 27.50
    5   2000-01-07  AAAA.SW 30.00
    ''')
    df = pd.read_csv(data, sep = '\s+', index_col=0)
    

    (we have added AAAA.SW to the list)

    Now this:

    df.groupby('symbol')['Close'].rolling(3).mean()
    

    produces sensible numbers (as there is only one date for AAAA we expect NaN:

    
    symbol    
    AAAA.SW  5          NaN
    HELN.SW  0          NaN
             1          NaN
             2    27.496667
             3    27.123333
             4    27.123333
    Name: Close, dtype: float64
    

    but this:

    df.groupby('symbol')['Close'].rolling(3).mean().reset_index(drop=True)
    

    produces wrong indices

    0          NaN
    1          NaN
    2          NaN
    3    27.496667
    4    27.123333
    5    27.123333
    Name: Close, dtype: float64
    

    and when put into the original df end up in the wrong rows:

    df['avg'] = df.groupby('symbol')['Close'].rolling(3).mean().reset_index(drop=True)
    df
    

    produces

         Date       symbol  Close   avg
    0   2000-01-03  HELN.SW 28.28   NaN
    1   2000-01-04  HELN.SW 27.50   NaN
    2   2000-01-05  HELN.SW 26.71   NaN
    3   2000-01-06  HELN.SW 27.16   27.496667
    4   2000-01-07  HELN.SW 27.50   27.123333
    5   2000-01-07  AAAA.SW 30.00   27.123333
    

    a solution is to do the aqssignments within each group, like this:

    df.groupby('symbol').apply(lambda g: g.assign(avg = g['Close'].rolling(3).mean())).reset_index(drop=True)
    

    so we get

    
        Date        symbol  Close   avg
    0   2000-01-07  AAAA.SW 30.00   NaN
    1   2000-01-03  HELN.SW 28.28   NaN
    2   2000-01-04  HELN.SW 27.50   NaN
    3   2000-01-05  HELN.SW 26.71   27.496667
    4   2000-01-06  HELN.SW 27.16   27.123333
    5   2000-01-07  HELN.SW 27.50   27.123333