I want to groupby one column (for example, 'country'). Each row has an associated 'start_date' and 'end_date'. For every row in the groupby, I want to increment the counter if the 'start_date' in the current row occurs after the most recent 'end_date' in the prior rows (and not increment otherwise). I want the same logic to apply to the trailing mean. I have sorted by country and start_date.
For example, I have a dataframe that can be generated with the following code:
import pandas as pd
# create df
data = {'country': ['arg', 'arg', 'arg', 'arg', 'arg', 'usa', 'usa', 'usa'],
'start_date': ['2020-01-01', '2020-01-01', '2020-05-01', '2021-05-01', '2021-07-01',
'2020-03-01', '2020-05-01', '2020-09-01'],
'end_date': ['2020-10-01', '2020-09-01', '2021-01-01', '2021-06-01', '2021-12-01',
'2020-10-01', '2020-08-01', '2021-05-01'],
'value': [250, 300, 150, 170, 200, 150, 100, 120]}
# Create DataFrame
df = pd.DataFrame(data)
And the desired result (with the new columns trailing_mean and count) would be:
country | start_date | end_date | value | trailing_mean | counter |
---|---|---|---|---|---|
arg | 2020-01-01 | 2020-10-01 | 250 | NA | 0 |
arg | 2020-01-01 | 2020-09-01 | 300 | NA | 0 |
arg | 2020-05-01 | 2021-01-01 | 150 | NA | 0 |
arg | 2021-05-01 | 2021-06-01 | 170 | 233.33 | 3 |
arg | 2021-07-01 | 2021-12-01 | 200 | 217.5 | 4 |
usa | 2020-03-01 | 2020-10-01 | 150 | NA | 0 |
usa | 2020-05-01 | 2020-08-01 | 100 | NA | 0 |
usa | 2020-09-01 | 2021-05-01 | 120 | 100 | 1 |
Notice how the trailing_mean is NA until there are records that have a start_date that occurs AFTER the end_date. On every record, the trailing mean only takes into account past records that have already completed (their end_date happens before the current record's start_date). This is the same logic for the counter. It is 0 and then it increments. It jumps from 0 to 3 because all three prior rows ended before that row has started
I have tried to groupby country and iterate through the rows. But I am having trouble accounting for the differences in end_dates. You can't just look back at the prior row you have to look at all prior records because the end_dates are not sequential
IIUC, you can apply
a custom function to generate your counts and trailing means for each group:
def count_and_avg(df):
mask = [df['end_date'] < start for start in df['start_date']]
df = df.assign(count=[sum(m) for m in mask],
trailing_mean=[df[m]['value'].sum() / sum(m) if sum(m) else 0 for m in mask]
)
return df
out = df.groupby('country').apply(count_and_avg).reset_index(drop=True)
Output for your sample data:
country start_date end_date value count trailing_mean
0 arg 2020-01-01 2020-10-01 250 0 0.000000
1 arg 2020-01-01 2020-09-01 300 0 0.000000
2 arg 2020-05-01 2021-01-01 150 0 0.000000
3 arg 2021-05-01 2021-06-01 170 3 233.333333
4 arg 2021-07-01 2021-12-01 200 4 217.500000
5 usa 2020-03-01 2020-10-01 150 0 0.000000
6 usa 2020-05-01 2020-08-01 100 0 0.000000
7 usa 2020-09-01 2021-05-01 120 1 100.000000