Search code examples
pythonpandasperformancegroup-bypandas-groupby

Is there a more efficient way of looping a group-by function in python?


I am trying to efficiently calculate different features such as 'Last Game', 'Season Average' and so on for a long list of different statistics, which I have placed in a last named GameStatistics. Because there are around 100 statistics and 100 calculations (I have shown one of the features 'Last Game' below as an example), it has become infeasible.

This is my current code, where the new column name specifying the statistic is essential:

for Statistic in GameStatistics:
    df[f'{Statistic} - Last Game'] = df.groupby('Name')[Statistic].shift()

Is there any quicker method to calculate features for all the statistics, possibly at the same time, thereby needing to perform each pandas group-by function only once?


Solution

  • I don't know how many game statistics you have. But in my testing, you code run fine with even 1000 statistics.

    Of course there is room for improvements. You should look for vectorized methods instead of relying on loops when working with pandas. Here's one way:

    result = pd.concat(
        [df, df.groupby("Name")[GameStatistics].shift().add_suffix(" - Last Game")],
        axis=1
    )