Python: How to groupby one column and then calculate a trailing mean and cumulative count where current start date is after prior end date

I want to groupby one column (for example, 'country'). Each row has an associated 'start_date' and 'end_date'. For every row in the groupby, I want to increment the counter if the 'start_date' in the current row occurs after the most recent 'end_date' in the prior rows (and not increment otherwise). I want the same logic to apply to the trailing mean. I have sorted by country and start_date.

For example, I have a dataframe that can be generated with the following code:

import pandas as pd
  
# create df
data = {'country': ['arg', 'arg', 'arg', 'arg', 'arg', 'usa', 'usa', 'usa'],
        'start_date': ['2020-01-01', '2020-01-01', '2020-05-01', '2021-05-01', '2021-07-01',
                       '2020-03-01', '2020-05-01', '2020-09-01'],
        'end_date': ['2020-10-01', '2020-09-01', '2021-01-01', '2021-06-01', '2021-12-01',
                     '2020-10-01', '2020-08-01', '2021-05-01'],
        'value': [250, 300, 150, 170, 200, 150, 100, 120]}
  
# Create DataFrame
df = pd.DataFrame(data)

And the desired result (with the new columns trailing_mean and count) would be:

country	start_date	end_date	value	trailing_mean	counter
arg	2020-01-01	2020-10-01	250	NA	0
arg	2020-01-01	2020-09-01	300	NA	0
arg	2020-05-01	2021-01-01	150	NA	0
arg	2021-05-01	2021-06-01	170	233.33	3
arg	2021-07-01	2021-12-01	200	217.5	4
usa	2020-03-01	2020-10-01	150	NA	0
usa	2020-05-01	2020-08-01	100	NA	0
usa	2020-09-01	2021-05-01	120	100	1

Notice how the trailing_mean is NA until there are records that have a start_date that occurs AFTER the end_date. On every record, the trailing mean only takes into account past records that have already completed (their end_date happens before the current record's start_date). This is the same logic for the counter. It is 0 and then it increments. It jumps from 0 to 3 because all three prior rows ended before that row has started

I have tried to groupby country and iterate through the rows. But I am having trouble accounting for the differences in end_dates. You can't just look back at the prior row you have to look at all prior records because the end_dates are not sequential

Solution

IIUC, you can apply a custom function to generate your counts and trailing means for each group:

def count_and_avg(df):
    mask = [df['end_date'] < start for start in df['start_date']]
    df = df.assign(count=[sum(m) for m in mask],
                   trailing_mean=[df[m]['value'].sum() / sum(m) if sum(m) else 0 for m in mask]
                   )
    return df

out = df.groupby('country').apply(count_and_avg).reset_index(drop=True)

Output for your sample data:

  country  start_date    end_date  value  count  trailing_mean
0     arg  2020-01-01  2020-10-01    250      0       0.000000
1     arg  2020-01-01  2020-09-01    300      0       0.000000
2     arg  2020-05-01  2021-01-01    150      0       0.000000
3     arg  2021-05-01  2021-06-01    170      3     233.333333
4     arg  2021-07-01  2021-12-01    200      4     217.500000
5     usa  2020-03-01  2020-10-01    150      0       0.000000
6     usa  2020-05-01  2020-08-01    100      0       0.000000
7     usa  2020-09-01  2021-05-01    120      1     100.000000