Search code examples
pythonpandasgroup-by

Python: How to groupby one column and then calculate a trailing mean and cumulative count where current start date is after prior end date


I want to groupby one column (for example, 'country'). Each row has an associated 'start_date' and 'end_date'. For every row in the groupby, I want to increment the counter if the 'start_date' in the current row occurs after the most recent 'end_date' in the prior rows (and not increment otherwise). I want the same logic to apply to the trailing mean. I have sorted by country and start_date.

For example, I have a dataframe that can be generated with the following code:

import pandas as pd
  
# create df
data = {'country': ['arg', 'arg', 'arg', 'arg', 'arg', 'usa', 'usa', 'usa'],
        'start_date': ['2020-01-01', '2020-01-01', '2020-05-01', '2021-05-01', '2021-07-01',
                       '2020-03-01', '2020-05-01', '2020-09-01'],
        'end_date': ['2020-10-01', '2020-09-01', '2021-01-01', '2021-06-01', '2021-12-01',
                     '2020-10-01', '2020-08-01', '2021-05-01'],
        'value': [250, 300, 150, 170, 200, 150, 100, 120]}
  
# Create DataFrame
df = pd.DataFrame(data)

And the desired result (with the new columns trailing_mean and count) would be:

country start_date end_date value trailing_mean counter
arg 2020-01-01 2020-10-01 250 NA 0
arg 2020-01-01 2020-09-01 300 NA 0
arg 2020-05-01 2021-01-01 150 NA 0
arg 2021-05-01 2021-06-01 170 233.33 3
arg 2021-07-01 2021-12-01 200 217.5 4
usa 2020-03-01 2020-10-01 150 NA 0
usa 2020-05-01 2020-08-01 100 NA 0
usa 2020-09-01 2021-05-01 120 100 1

Notice how the trailing_mean is NA until there are records that have a start_date that occurs AFTER the end_date. On every record, the trailing mean only takes into account past records that have already completed (their end_date happens before the current record's start_date). This is the same logic for the counter. It is 0 and then it increments. It jumps from 0 to 3 because all three prior rows ended before that row has started

I have tried to groupby country and iterate through the rows. But I am having trouble accounting for the differences in end_dates. You can't just look back at the prior row you have to look at all prior records because the end_dates are not sequential


Solution

  • IIUC, you can apply a custom function to generate your counts and trailing means for each group:

    def count_and_avg(df):
        mask = [df['end_date'] < start for start in df['start_date']]
        df = df.assign(count=[sum(m) for m in mask],
                       trailing_mean=[df[m]['value'].sum() / sum(m) if sum(m) else 0 for m in mask]
                       )
        return df
    
    out = df.groupby('country').apply(count_and_avg).reset_index(drop=True)
    

    Output for your sample data:

      country  start_date    end_date  value  count  trailing_mean
    0     arg  2020-01-01  2020-10-01    250      0       0.000000
    1     arg  2020-01-01  2020-09-01    300      0       0.000000
    2     arg  2020-05-01  2021-01-01    150      0       0.000000
    3     arg  2021-05-01  2021-06-01    170      3     233.333333
    4     arg  2021-07-01  2021-12-01    200      4     217.500000
    5     usa  2020-03-01  2020-10-01    150      0       0.000000
    6     usa  2020-05-01  2020-08-01    100      0       0.000000
    7     usa  2020-09-01  2021-05-01    120      1     100.000000