Now I have a table:
Score Customer ID my_dates Threshold Model_name is_alert
50 8 2017-08-05 50 Mod1 yes
50 9 2017-12-05 50 Mod1 yes
50 28 2017-05-22 50 Mod2 yes
50 28 2017-05-26 50 Mod2 yes
50 36 2017-06-20 50 Mod2 yes
If the score equal or exceed the threshold, is_alert will show 'yes'
Now I want to bin the date into the following formate and print how many alerts in each bin under each model but if within 7 days for one customer is alerted more than once, only the first hit contributes to the total score:
Model_name Jan-17 Feb-17 Mar-17 APR-17 May-17 Jun-17
Mod1
Mod2
Can somebody help me with that? Thanks
Use crosstab
with convert datetimes to month periods by Series.dt.to_period
, last convert to names of months by PeriodIndex.strftime
, but before get difference per groups by DataFrameGroupBy.diff
and filter rows with missing values (first rows per groups) and less or equal like 7
by Series.ge
and boolean indexing
:
df['my_dates'] = pd.to_datetime(df['my_dates'])
m = df['my_dates'].dt.to_period('m')
df['diff'] = df.groupby(['Model_name'])['my_dates'].diff().dt.days
print (df)
Score Customer ID my_dates Threshold Model_name is_alert diff
0 50 8 2017-08-05 50 Mod1 yes NaN
1 50 9 2017-12-05 50 Mod1 yes 122.0
2 50 28 2017-05-22 50 Mod2 yes NaN
3 50 28 2017-05-26 50 Mod2 yes 4.0
4 50 36 2017-06-20 50 Mod2 yes 25.0
df = df[df['diff'].ge(7) | df['diff'].isna()]
df1 = pd.crosstab(df['Model_name'], m)
df1.columns = df1.columns.strftime('%b-%y')
print (df1)
my_dates May-17 Jun-17 Aug-17 Dec-17
Model_name
Mod1 0 0 1 1
Mod2 1 1 0 0