Search code examples
pythondataframegroup-bymedian

Python Dataframe Rolling median with group by


I have a dataframe with three columns, viz., date,commodity and values. I want to add another column, median_20, the rolling median of last 20 days for each commodity in the df. Also, I want to add other columns which show value n days before, for example, lag_1 column shows value 1 day before for a given commodity, lag_2 shows value 2 days before, and so on. My df is quite big (>2 million rows) in size.

dates = pd.date_range('2017-01-01', '2017-07-02')
df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
df = pd.concat([df1, df2])
df = df.sort('date')

          date  commodity     value
0   2017-01-01       GOLD -1.239422
0   2017-01-01     SILVER -0.209840
1   2017-01-02     SILVER  0.146293
1   2017-01-02       GOLD  1.422454
2   2017-01-03       GOLD  0.453222
...

Solution

  • Try:

    import pandas as pd
    import numpy as np
    
    # create dataframe
    dates = pd.date_range('2017-01-01', '2017-07-02')
    df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
    df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
    df = pd.concat([df1, df2])
    df = df.sort_values(by='date').reset_index(drop=True)
    
    # create columns
    df['median_20_temp'] = df.groupby('market')['commodity'].rolling(20).median()
    df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
    df['lag_1'] = df.groupby('market')['commodity'].shift(1)
    df['lag_2'] = df.groupby('market')['commodity'].shift(2)
    df.drop(['median_20_temp'], axis=1, inplace=True)
    

    Edit:

    The following should work with version 0.16.2:

    import numpy as np
    import pandas as pd
    
    
    np.random.seed(123)
    dates = pd.date_range('2017-01-01', '2017-07-02')
    df1 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'GOLD'})
    df2 = pd.DataFrame({'date':dates, 'commodity':np.random.normal(size = len(dates)), 'market':'SILVER'})
    df = pd.concat([df1, df2])
    df = df.sort('date').reset_index(drop=True)
    
    # create columns
    df['median_20_temp'] = df.groupby('market')['commodity'].apply(lambda s: pd.rolling_median(s, 20))
    df['median_20'] = df.groupby('market')['median_20_temp'].shift(1)
    df['lag_1'] = df.groupby('market')['commodity'].shift(1)
    df['lag_2'] = df.groupby('market')['commodity'].shift(2)
    df.drop(['median_20_temp'], axis=1, inplace=True)
    

    I hope this helps.