Search code examples
pythonpandasresampling

Pandas resampling with category variable


I would like to resample a dataframe hourly and keep the category variable, how can I do that efficiently. I normally use df = df.resample('h').sum() but this does not work with my category variable. Any idea?

date  = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value = [33.24  , 31.71  , 34.39  , 34.49 ]
value2 = [2*x for x in value]
value3 = [3*x for x in value]
cat = ['a','a','b','b']
df = pd.DataFrame({'value':value,'value2':value2,'value3':value3,'index':date,'category':cat})

df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)

print(df.head())
                    value  value2  value3    category
index                                     
2015-02-03 23:00:00  33.24   66.48   99.72    a
2015-02-03 23:30:00  31.71   63.42   95.13    a
2015-02-04 00:00:00  34.39   68.78  103.17    b
2015-02-04 00:30:00  34.49   68.98  103.47    b

expected result:

                     value  value2  value3    category
index                                     
2015-02-03 23:00:00  64.95   129.9   194.85    a
2015-02-04 00:00:00  68.88   137.76  206.64    b

Solution

  • Use DataFrameGroupBy.resample - it means chain groupby with resample:

    df = df.groupby('category').resample('h').sum()
    print (df)
                                  value  value2  value3
    category index                                     
    a        2015-02-03 23:00:00  64.95  129.90  194.85
    b        2015-02-04 00:00:00  68.88  137.76  206.64
    

    Or is possible use Grouper:

    df = df.groupby(['category', pd.Grouper(freq='h')]).sum()
    print (df)
                                  value  value2  value3
    category index                                     
    a        2015-02-03 23:00:00  64.95  129.90  194.85
    b        2015-02-04 00:00:00  68.88  137.76  206.64