Search code examples
pandasdatedate-range

How to produce monthly count when given a date range in pandas?


I have a dataframe that records users, a label, and the start and end date of them being labelled as such e.g.

user label start_date end_date
1 x 2018-01-01 2018-10-01
2 x 2019-05-10 2020-01-01
3 y 2019-04-01 2022-04-20
1 b 2018-10-01 2020-05-08

etc

where each row is for a given user and a label; a user appears multiple times for different labels

I want to get a count of users for every month for each label, such as this:

date count_label_x count_label_y count_label_b count_label_
2018-01 10 0 20 5
2018-02 2 5 15 3
2018-03 20 6 8 3

etc

where for instance for the first entry of the previous table, that user should be counted once for every month between his start and end date. The problem boils down to this and since I only have a few labels I can filter labels one by one and produce one output for each label. But how do I check and count users given an interval?

Thanks


Solution

  • You can use date_range combined with to_period to generate the active months, then pivot_table with aggfunc='nunique' to aggregate the unique user (if you want to count the duplicated users use aggfunc='count'):

    out = (df
       .assign(period=[pd.date_range(a, b, freq='M').to_period('M')
                       for a,b in zip(df['start_date'], df['end_date'])])
       .explode('period')
       .pivot_table(index='period', columns='label', values='user',
                    aggfunc='nunique', fill_value=0)
    )
    

    output:

    label    b  x  y
    period          
    2018-01  0  1  0
    2018-02  0  1  0
    2018-03  0  1  0
    2018-04  0  1  0
    2018-05  0  1  0
    ...
    2021-12  0  0  1
    2022-01  0  0  1
    2022-02  0  0  1
    2022-03  0  0  1
    

    handling NaT

    if you have the same start/end and want to count the value:

    out = (df
       .assign(period=[pd.date_range(a, b, freq='M').to_period('M')
                       for a,b in zip(df['start_date'], df['end_date'])])
       .explode('period')
       .assign(period=lambda d: d['period'].fillna(d['start_date'].dt.to_period('M')))
       .pivot_table(index='period', columns='label', values='user',
                    aggfunc='nunique', fill_value=0)
    )