Search code examples
pandasgroup-byrolling-computation

Pandas compute rollling average after groupie


I am trying to compute the rolling means of values after grouping by. My dataset looks like

import pandas as pd
df = pd.DataFrame({'day': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03', '2020-01-03','2020-01-03'], 
               'weather': ['rain', 'sun', 'rain', 'sun', 'rain', 'sun', 'rain', 'sun', 'rain', 'sun', 'rain', 'sun'], 
               'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})

enter image description here

Now, I want to have a rolling mean per day, and per weather.

While

>>>> df.groupby(['day', 'weather']).value.mean()
day         weather
2020-01-01  rain        2
            sun         3
2020-01-02  rain        6
            sun         7
2020-01-03  rain       10
            sun        11

proper computes the mean, the rolling version of it, does not seem to work?

>>>> df.groupby(['day', 'weather']).value.rolling(2).mean()
day         weather    
2020-01-01  rain     0      NaN
                     2      2.0
            sun      1      NaN
                     3      3.0
2020-01-02  rain     4      NaN
                     6      6.0
            sun      5      NaN
                     7      7.0
2020-01-03  rain     8      NaN
                     10    10.0
            sun      9      NaN
                     11    11.0

What's the right way of doing it?

I would expect an output that is the mean over multiple days i.e. (ignore the index)

day         weather    
2020-01-01  rain     2      2.0
            sun      3      3.0
2020-01-02  rain     6      4.0
            sun      5      5.0
2020-01-03  rain     8      8.0
            sun      9      9.0

Solution

  • I think you are referring to the NaN values? Your window is set to 2 so the first value of each group will be set to NaN because of min_periods. Here is a quote of the documentation :

    For a window that is specified by an integer, min_periods will default to the size of the window.

    df.groupby(['day', 'weather']).value.rolling(2,min_periods=1).mean()
    
    day         weather    
    2020-01-01  rain     0      1.0
                         2      2.0
                sun      1      2.0
                         3      3.0
    2020-01-02  rain     4      5.0
                         6      6.0
                sun      5      6.0
                         7      7.0
    2020-01-03  rain     8      9.0
                         10    10.0
                sun      9     10.0
                         11    11.0
    

    Was that what you are searching for?

    UPDATE

    You want a mean value for each day and wheather and than a rolling mean over x days of the computed mean (if I understand that right). Try this:

    out = df.groupby(['day','weather',],as_index=False)['value'].mean()
    print(out)
    
              day weather  value
    0  2020-01-01    rain    2.0
    1  2020-01-01     sun    3.0
    2  2020-01-02    rain    6.0
    3  2020-01-02     sun    7.0
    4  2020-01-03    rain   10.0
    5  2020-01-03     sun   11.0
    

    With the daily mean you can compute your rolling window:

    out['rolling_mean'] = out.groupby('weather', as_index=False)['value'].rolling(2,min_periods=1).mean()['value']
    
    print(out)
    
              day weather  value  rolling_mean
    0  2020-01-01    rain    2.0           2.0
    1  2020-01-01     sun    3.0           3.0
    2  2020-01-02    rain    6.0           4.0
    3  2020-01-02     sun    7.0           5.0
    4  2020-01-03    rain   10.0           8.0
    5  2020-01-03     sun   11.0           9.0