Search code examples
pandasmoving-average

Moving Average Pandas Across Group


My data has the following structure:

np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
                                  2,2,
                                  3,3,3,3,3,
                                  4,4,4,
                                  5,5,5,5,5,5,5,
                                  6,
                                  7,7,
                                  8,8,8,8,8,8,8,
                                  9,9,
                                  10,10
                                  ],
                    'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
                             '2021-01-02','2021-01-05',
                             '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
                             '2021-01-02','2021-01-05','2021-01-07',
                             '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
                             '2021-01-02',
                              '2021-01-02','2021-01-05',
                              '2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
                              '2021-01-02','2021-01-05',
                              '2021-01-02','2021-01-05'
                             ],
                    'Quantity': np.floor(np.random.random(size=35)*100)
                    })

And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data. enter image description here

Thanks!


Solution

  • IIUC, you can aggregate the similar dates first, getting the sum and count.

    Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).

    Finally, perform the ratio of sum and count to get the mean:

    g = tdf.groupby('Date')['Quantity']
    out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
    

    output:

    Date
    2021-01-02          NaN
    2021-01-05    50.210526
    2021-01-07    45.071429
    2021-01-09    41.000000
    2021-01-11    44.571429
    2021-01-13    48.800000
    2021-01-15    50.500000
    Name: Quantity, dtype: float64
    
    joining the original data:
    g = tdf.groupby('Date')['Quantity']
    s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
    tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
    

    output:

        person_id       Date  Quantity  Quantity_MA(2)
    0           1 2021-01-02      87.0             NaN
    4           2 2021-01-02      41.0             NaN
    6           3 2021-01-02      68.0             NaN
    11          4 2021-01-02      11.0             NaN
    14          5 2021-01-02      16.0             NaN
    21          6 2021-01-02      51.0             NaN
    22          7 2021-01-02      38.0             NaN
    24          8 2021-01-02      51.0             NaN
    31          9 2021-01-02      90.0             NaN
    33         10 2021-01-02      45.0             NaN
    1           1 2021-01-05      58.0       50.210526
    5           2 2021-01-05      11.0       50.210526
    7           3 2021-01-05      43.0       50.210526
    12          4 2021-01-05      44.0       50.210526
    15          5 2021-01-05      52.0       50.210526
    23          7 2021-01-05      99.0       50.210526
    25          8 2021-01-05      55.0       50.210526
    32          9 2021-01-05      66.0       50.210526
    34         10 2021-01-05      28.0       50.210526
    2           1 2021-01-07      27.0       45.071429
    8           3 2021-01-07      55.0       45.071429
    13          4 2021-01-07      58.0       45.071429
    16          5 2021-01-07      32.0       45.071429
    26          8 2021-01-07       3.0       45.071429
    3           1 2021-01-09      18.0       41.000000
    9           3 2021-01-09      36.0       41.000000
    17          5 2021-01-09      69.0       41.000000
    27          8 2021-01-09      71.0       41.000000
    10          3 2021-01-11      40.0       44.571429
    18          5 2021-01-11      36.0       44.571429
    28          8 2021-01-11      42.0       44.571429
    19          5 2021-01-13      83.0       48.800000
    29          8 2021-01-13      43.0       48.800000
    20          5 2021-01-15      48.0       50.500000
    30          8 2021-01-15      28.0       50.500000