Search code examples
pythonpandasgroup-byrolling-computation

Pandas - assign groupby rolling mean results to new column respecting initial dataframe


My goal is to to calculate the moving average line for avg_price column each customer.

The dataframe looks like this:

    customer    avg_price   avg_price2  count1  count2  rate
date_time                       
2022-06-11 00:00:00 Customer1   4.4656  1.25    36  11084   0.003237
2022-06-11 00:00:00 Customer2   7.8873  0.92    10  22150   0.000451
2022-06-11 00:00:00 Customer3   2.3016  1.37    1   2521    0.000397
2022-06-11 00:00:00 Customer4   3.2421  1.05    221 98973   0.002228
2022-06-11 00:00:00 Customer5   1.0050  0.94    2   410     0.004854
... ... ... ... ... ... ...
2022-06-21 10:00:00 Customer1   4.9450  1.99    340 118000  0.002873
2022-06-21 10:00:00 Customer2   4.0643  2.06    268 20850   0.012691
2022-06-21 10:00:00 Customer3   3.7034  1.00    25  5100    0.004878
2022-06-21 10:00:00 Customer4   5.0367  2.64    2098    118251  0.017433
2022-06-21 10:00:00 Customer5   2.7429  1.57    50  11900   0.004184

Now, I already believe I found a way to do the calculation so I'm trying to figure out a way to assign the results of this group to the dataframe as a new column.

df.groupby('customer')['avg_price'].rolling(50).mean()

The results looks like this:

customer  date_time          
Customer1   2022-06-11 00:00:00          NaN
            2022-06-11 06:44:00          NaN
            2022-06-11 08:07:00          NaN
            2022-06-11 08:10:00          NaN
            2022-06-11 08:20:00          NaN
                                     ...    
Customer2   2022-06-21 09:56:00    14.048820
            2022-06-21 09:57:00    14.060702
            2022-06-21 09:58:00    14.054652
            2022-06-21 09:59:00    14.024310
            2022-06-21 10:00:00    14.020486
Name: avg_price, Length: 228442, dtype: float64

When I tried adding this to the dataframe as a column, I got the following error:

df['ma_24h'] = df.groupby('customer' ['avg_price'].rolling(50).mean().reset_index(level=0,drop=True)

ValueError: cannot reindex from a duplicate axis

I want to add the calculated column, the results without customer to the existing dataframe so that it respects the already existing index which the date_time column.


Solution

  • I believe I found a way thanks to this post: How to include Moving Average with Pandas based on Values on other Columns

    df['ma_24h'] = (df.groupby(['customer'])['avg_price']
                    .rolling('24H', min_periods=360)
                    .mean()
                    .sort_index(level=1)
                    .values)
    

    I changed the rolling to 24H. The main issue I had was with the index and getting the values and apparently I had to sort the index and just select the values.

    If there are other simpler ways to accomplish this, I'm open for them.