I am trying to resample a data frame. First, I want to keep several aggregations in the result. Second, there is an additional aggregation in interest for a specific column. Since this aggregation is only relevant to a single column, the resampler can be limited to this column in order not to unnecessarily apply the aggregation to the other columns.
This scenario is working for a simple one-dimensional column index:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.rand(50,4), index=pd.to_datetime(np.arange(0, 50), unit="s"), columns=["a", "b", "c", "d"])
r = df.resample("10s")
result = r.aggregate(["mean", "std"])
result[("d", "ffill")] = r["d"].ffill()
print(result)
However, as soon as I start to use multi-indexed columns problems appear. First, I can not keep several aggregations at once:
df.columns = pd.MultiIndex.from_product([("a", "b"), ("alpha", "beta")])
r = df.resample("10s") # can be omitted
result = r.aggregate(["mean", "std"])
---> AttributeError: 'Series' object has no attribute 'columns'
Second, the resampler can not be limited to the relevant column anymore:
r[("b", "beta")].ffill()
--> KeyError: "Columns not found: 'b', 'beta'"
How can I transform my concern from simple indices to multi-indices?
you can use pd.Grouper
in a groupby
instead of resample, such as:
result = df.groupby(pd.Grouper(freq='10s',level=0)).aggregate(["mean", "std"])
print (result)
a b \
alpha beta alpha
mean std mean std mean
1970-01-01 00:00:00 0.460569 0.312508 0.476511 0.260534 0.479577
1970-01-01 00:00:10 0.441498 0.315277 0.487855 0.306068 0.535842
1970-01-01 00:00:20 0.569884 0.248503 0.320552 0.288479 0.507755
1970-01-01 00:00:30 0.478037 0.262654 0.552214 0.251581 0.505132
1970-01-01 00:00:40 0.611227 0.328916 0.473773 0.241604 0.358298
beta
std mean std
1970-01-01 00:00:00 0.357493 0.448487 0.294432
1970-01-01 00:00:10 0.259145 0.472250 0.320954
1970-01-01 00:00:20 0.369490 0.432944 0.150473
1970-01-01 00:00:30 0.298759 0.381614 0.248785
1970-01-01 00:00:40 0.203831 0.381412 0.374965
and for the second part, I'm not sure what you mean, but according to the result given in the case of the single column level, try this it gives a result
result[("b", "beta",'ffill')] = df.groupby(pd.Grouper(freq='10s',level=0))[[("b", "beta")]].first()