I have a dataframe that records users, a label, and the start and end date of them being labelled as such e.g.
user | label | start_date | end_date |
---|---|---|---|
1 | x | 2018-01-01 | 2018-10-01 |
2 | x | 2019-05-10 | 2020-01-01 |
3 | y | 2019-04-01 | 2022-04-20 |
1 | b | 2018-10-01 | 2020-05-08 |
etc
where each row is for a given user and a label; a user appears multiple times for different labels
I want to get a count of users for every month for each label, such as this:
date | count_label_x | count_label_y | count_label_b | count_label_ |
---|---|---|---|---|
2018-01 | 10 | 0 | 20 | 5 |
2018-02 | 2 | 5 | 15 | 3 |
2018-03 | 20 | 6 | 8 | 3 |
etc
where for instance for the first entry of the previous table, that user should be counted once for every month between his start and end date. The problem boils down to this and since I only have a few labels I can filter labels one by one and produce one output for each label. But how do I check and count users given an interval?
Thanks
You can use date_range
combined with to_period
to generate the active months, then pivot_table
with aggfunc='nunique'
to aggregate the unique user (if you want to count the duplicated users use aggfunc='count'
):
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)
output:
label b x y
period
2018-01 0 1 0
2018-02 0 1 0
2018-03 0 1 0
2018-04 0 1 0
2018-05 0 1 0
...
2021-12 0 0 1
2022-01 0 0 1
2022-02 0 0 1
2022-03 0 0 1
if you have the same start/end and want to count the value:
out = (df
.assign(period=[pd.date_range(a, b, freq='M').to_period('M')
for a,b in zip(df['start_date'], df['end_date'])])
.explode('period')
.assign(period=lambda d: d['period'].fillna(d['start_date'].dt.to_period('M')))
.pivot_table(index='period', columns='label', values='user',
aggfunc='nunique', fill_value=0)
)