Search code examples
pythonpandaspython-datetimepython-dateutil

Bin the date into month


Now I have a table:

Score  Customer ID      my_dates        Threshold Model_name   is_alert
50         8           2017-08-05         50     Mod1          yes
50         9           2017-12-05         50     Mod1          yes
50         28          2017-05-22         50     Mod2          yes
50         28          2017-05-26         50     Mod2          yes
50         36          2017-06-20         50     Mod2          yes

If the score equal or exceed the threshold, is_alert will show 'yes'

Now I want to bin the date into the following formate and print how many alerts in each bin under each model but if within 7 days for one customer is alerted more than once, only the first hit contributes to the total score:

Model_name   Jan-17    Feb-17    Mar-17    APR-17   May-17   Jun-17 
Mod1                                                           
Mod2                                                           

Can somebody help me with that? Thanks


Solution

  • Use crosstab with convert datetimes to month periods by Series.dt.to_period, last convert to names of months by PeriodIndex.strftime, but before get difference per groups by DataFrameGroupBy.diff and filter rows with missing values (first rows per groups) and less or equal like 7 by Series.ge and boolean indexing:

    df['my_dates'] = pd.to_datetime(df['my_dates'])
    
    m = df['my_dates'].dt.to_period('m')
    df['diff'] = df.groupby(['Model_name'])['my_dates'].diff().dt.days
    print (df)
       Score  Customer ID   my_dates  Threshold Model_name is_alert   diff
    0     50            8 2017-08-05         50       Mod1      yes    NaN
    1     50            9 2017-12-05         50       Mod1      yes  122.0
    2     50           28 2017-05-22         50       Mod2      yes    NaN
    3     50           28 2017-05-26         50       Mod2      yes    4.0
    4     50           36 2017-06-20         50       Mod2      yes   25.0
    
    df = df[df['diff'].ge(7) | df['diff'].isna()]
    df1 = pd.crosstab(df['Model_name'], m)
    df1.columns = df1.columns.strftime('%b-%y')
    print (df1)
    my_dates    May-17  Jun-17  Aug-17  Dec-17
    Model_name                                
    Mod1             0       0       1       1
    Mod2             1       1       0       0