Search code examples
pythonpandaspandas-groupbymulti-index

Resampling with pandas.MultiIndex: Resampler.aggregate() & Resampler[column]


I am trying to resample a data frame. First, I want to keep several aggregations in the result. Second, there is an additional aggregation in interest for a specific column. Since this aggregation is only relevant to a single column, the resampler can be limited to this column in order not to unnecessarily apply the aggregation to the other columns.

This scenario is working for a simple one-dimensional column index:

import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.rand(50,4), index=pd.to_datetime(np.arange(0, 50), unit="s"), columns=["a", "b", "c", "d"])
r = df.resample("10s")
result = r.aggregate(["mean", "std"])
result[("d", "ffill")] = r["d"].ffill()
print(result)

However, as soon as I start to use multi-indexed columns problems appear. First, I can not keep several aggregations at once:

df.columns = pd.MultiIndex.from_product([("a", "b"), ("alpha", "beta")])
r = df.resample("10s")    # can be omitted
result = r.aggregate(["mean", "std"])
---> AttributeError: 'Series' object has no attribute 'columns'

Second, the resampler can not be limited to the relevant column anymore:

r[("b", "beta")].ffill()
--> KeyError: "Columns not found: 'b', 'beta'"

How can I transform my concern from simple indices to multi-indices?


Solution

  • you can use pd.Grouper in a groupby instead of resample, such as:

    result = df.groupby(pd.Grouper(freq='10s',level=0)).aggregate(["mean", "std"])
    print (result)
                               a                                       b  \
                            alpha                beta               alpha   
                             mean       std      mean       std      mean   
    1970-01-01 00:00:00  0.460569  0.312508  0.476511  0.260534  0.479577   
    1970-01-01 00:00:10  0.441498  0.315277  0.487855  0.306068  0.535842   
    1970-01-01 00:00:20  0.569884  0.248503  0.320552  0.288479  0.507755   
    1970-01-01 00:00:30  0.478037  0.262654  0.552214  0.251581  0.505132   
    1970-01-01 00:00:40  0.611227  0.328916  0.473773  0.241604  0.358298   
    
    
                                       beta            
                              std      mean       std  
    1970-01-01 00:00:00  0.357493  0.448487  0.294432  
    1970-01-01 00:00:10  0.259145  0.472250  0.320954  
    1970-01-01 00:00:20  0.369490  0.432944  0.150473  
    1970-01-01 00:00:30  0.298759  0.381614  0.248785  
    1970-01-01 00:00:40  0.203831  0.381412  0.374965  
    

    and for the second part, I'm not sure what you mean, but according to the result given in the case of the single column level, try this it gives a result

    result[("b", "beta",'ffill')] = df.groupby(pd.Grouper(freq='10s',level=0))[[("b", "beta")]].first()