Search code examples
pythonpandaspandas-groupbymedianrolling-computation

Calculating moving median within group


I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.

Here are the sample data:

id      date        price
1637027 2020-01-21  7045204.0
280955  2020-01-11  3590000.0
782078  2020-01-28  2600000.0
1921717 2020-02-17  5500000.0
1280579 2020-01-23  869000.0
2113506 2020-01-23  628869.0
580638  2020-01-25  650000.0
1843598 2020-02-29  969000.0
2300960 2020-01-24  5401530.0
1921380 2020-02-19  1220000.0
853202  2020-02-02  2990000.0
1024595 2020-01-27  3300000.0
565202  2020-01-25  3540000.0
703824  2020-01-18  3990000.0
426016  2020-01-26  830000.0

I got close with combining rolling and groupby:

df.groupby('date').rolling(window = 4, on = 'date')['price'].median()

But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.

Result now looks like this:

date        date      
2020-01-10  2020-01-10          NaN
            2020-01-10          NaN
            2020-01-10          NaN
            2020-01-10    3070000.0
            2020-01-10    4890000.0
                            ...    
2020-03-11  2020-03-11    4290000.0
            2020-03-11    3745000.0
            2020-03-11    3149500.0
            2020-03-11    3149500.0
            2020-03-11    3149500.0
Name: price, Length: 389716, dtype: float64

It seems it just deleted 3 first values and then just printed price value.

Is it possible to get one lagged / moving median value per one date?


Solution

  • You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime

    #sort and change to datetime
    df_f = df[['date','price']].copy().sort_values('date')
    df_f['date'] = pd.to_datetime(df_f['date'])
    
    #create the column rolling
    df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
    
    #drop_duplicates and keep the last row per day
    df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
    
    print (df_f)
    
             date      price
    0  2020-01-11  3590000.0
    1  2020-01-18  3990000.0
    2  2020-01-21  5517602.0
    3  2020-01-23   869000.0
    4  2020-01-24  3135265.0
    5  2020-01-25  2204500.0
    6  2020-01-26   849500.0
    7  2020-01-27   869000.0
    8  2020-01-28  2950000.0
    9  2020-02-02  2990000.0
    10 2020-02-17  5500000.0
    11 2020-02-19  3360000.0
    12 2020-02-29   969000.0