End goal: have n lagged (named) feature values in every row of a pandas dataframe respecting group
Secondary goal: No for loops in pandas call
A naive implementation could be:
n = 7
for feature in features:
for i in range(1, n + 1):
df[feature + "_" + str(i)] = df.groupby(["a", "b"])[feature].shift(i)
I would like to have something like this instead, saving len(features) * n - 1 calls to the pandas api:
lagged_features = {}
n = 7
for feature in features:
for i in range(1, n + 1):
key = feature + "_" + str(i)
lagged_features[key] = pd.NamedAgg(feature, lambda x: x.shift(i))
df_lagged = df.groupby(["a", "b"]).agg(**lagged_features)
However agg doesn't like shift :(
And you get a ValueError: Must produce aggregated value
I believe you should be able to use apply? But, I can only think of a way to get the number of calls down to n (much better, but it would be so much cleaner if you could do something like agg)
Your initial approach attempted to use pandas.DataFrame.groupby().agg()
, but, .agg()
method requires functions that return a single value per group, and shift()
does not meet this requirement as it returns a Series
instead.
To solve this problem, you can use pandas.DataFrame.groupby().apply()
as you mentioned. This function allows you to apply a function that returns a DataFrame
to each group.
Here is an example. It's a solution where you only need to call the pandas API once for each group:
import pandas as pd
n = 2
features = ['feature1', 'feature2']
def lag_features(grp):
lags = {f'{feature}_{i}': grp[feature].shift(i) for feature in features for i in range(1, n + 1)}
return pd.DataFrame(lags)
df_lagged = df.groupby(["a", "b"]).apply(lag_features).reset_index(level=2, drop=True)
Explanation
lag_features(grp)
is a helper function that gets applied to each group. It takes a DataFrame
(a single group) as input. Inside this function, a dictionary comprehension is used to generate the lagged features for each feature in the group. This dictionary is then transformed into DataFrame
, which is returned by the function.df.groupby(["a", "b"]).apply(lag_features)
groups the DataFrame
by a
and b
, then applies lag_features(grp)
to each group. This gives us a DataFrame
with multi-index where the 3-rd level of the index is the original DataFrame's index.reset_index(level=2, drop=True)
is used to remove this 3rd level of the index