Search code examples
pythonpandasdataframegroup-byrolling-computation

Pandas rolling sum within a group


I am trying to calculate a rolling sum or any other statistic (e.g. mean), within each group. Below I am giving an example where the window is 2 and the statistic is sum.

df = pd.DataFrame.from_dict({'class': ['a', 'b', 'b', 'c', 'c', 'c', 'b', 'a', 'b'],
                        'val': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
df['sum2_per_class'] = [1, 2, 5, 4, 9, 11, 10, 9, 16] # I want to compute this column
# df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum() # what I tried

    class  val  sum2_per_class
 0     a    1               1
 1     b    2               2
 2     b    3               5
 3     c    4               4
 4     c    5               9
 5     c    6              11
 6     b    7              10
 7     a    8               9
 8     b    9              16

Here's what I tried and the corresponding error:

df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum()

TypeError: incompatible index of inserted column with frame index

Solution

  • As the error message conveys, the rolling sum operation returns a pandas Series with a MultiIndex, which can't be directly assigned to a single column in a dataframe.

    A possible fix is to use reset_index() to convert the MultiIndex to a normal index like the following:

    df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2).sum().reset_index(level=0, drop=True)
    

    However, after running the above code a few of the I was getting unexpected NaN values in the 'sum2_per_class' column as follows: [NaN, NaN, 5, NaN, 9, 11, 10, 9, 16] while other values are as expected.

    After investigating the NaN issues I came to the following conclusion:

    • The Rolling Sum operation requires at least two CONSECUTIVE rows within each group to calculate the sum. for example for the first group 'a' we have: 1) Row 0 with val1=1 and Row 7 with val=8 you expect the rolling sum to be 1 + 8 = 9 while these rows are not consecutive and will result in NaN. For other groups where we got the expected rolling sum the grouped rows are consecutive. For example for group 'c' we have: Row 3, Row 4, and Row 5.

    Update: To solve the NaN issues you can specify min_periods=1 in the Rolling function like below:

    df['sum2_per_class'] = df[['class', 'val']].groupby('class').rolling(2, min_periods=1).sum().reset_index(level=0, drop=True)