Search code examples
pythonpandastime-series

Pandas dataframe: How to create multiple columns of different lagged values with groubpy in one call?


End goal: have n lagged (named) feature values in every row of a pandas dataframe respecting group

Secondary goal: No for loops in pandas call

A naive implementation could be:

n = 7
for feature in features:
    for i in range(1, n + 1):
        df[feature + "_" + str(i)] = df.groupby(["a", "b"])[feature].shift(i)

I would like to have something like this instead, saving len(features) * n - 1 calls to the pandas api:

lagged_features = {}
n = 7
for feature in features:
    for i in range(1, n + 1):
        key = feature + "_" + str(i)
        lagged_features[key] = pd.NamedAgg(feature, lambda x: x.shift(i))

df_lagged = df.groupby(["a", "b"]).agg(**lagged_features)

However agg doesn't like shift :(

And you get a ValueError: Must produce aggregated value

I believe you should be able to use apply? But, I can only think of a way to get the number of calls down to n (much better, but it would be so much cleaner if you could do something like agg)


Solution

  • Your initial approach attempted to use pandas.DataFrame.groupby().agg(), but, .agg() method requires functions that return a single value per group, and shift() does not meet this requirement as it returns a Series instead.

    To solve this problem, you can use pandas.DataFrame.groupby().apply() as you mentioned. This function allows you to apply a function that returns a DataFrame to each group.

    Here is an example. It's a solution where you only need to call the pandas API once for each group:

    import pandas as pd
    
    n = 2
    features = ['feature1', 'feature2']
    
    def lag_features(grp):
        lags = {f'{feature}_{i}': grp[feature].shift(i) for feature in features for i in range(1, n + 1)}
        return pd.DataFrame(lags)
    
    df_lagged = df.groupby(["a", "b"]).apply(lag_features).reset_index(level=2, drop=True)
    
    

    Explanation

    • lag_features(grp) is a helper function that gets applied to each group. It takes a DataFrame (a single group) as input. Inside this function, a dictionary comprehension is used to generate the lagged features for each feature in the group. This dictionary is then transformed into DataFrame, which is returned by the function.
    • df.groupby(["a", "b"]).apply(lag_features) groups the DataFrame by a and b, then applies lag_features(grp) to each group. This gives us a DataFrame with multi-index where the 3-rd level of the index is the original DataFrame's index.
    • reset_index(level=2, drop=True) is used to remove this 3rd level of the index