I have this dataset, when there isn't enough data in the 3 or 6 months window
, the service registers the entry as NaN
.
Dataset:
df = pd.DataFrame({'customer':['C1000','C1000','C1000','C1000','C1000','C1000','C1000','C1000','C2000','C2000','C2000','C2000'],
'account': ['A1100','A1100','A1100','A1200','A1200','A1300','A1300','A1300','A2100','A2100','A2100','A2100'],
'month': ['2019-10-01','2019-11-01','2019-12-01','2019-10-01','2019-11-01','2019-10-01','2019-11-01','2019-12-01','2019-09-01','2019-10-01','2019-11-01','2019-12-01'],
'invoice': [34000,55000,80000,90000,55000,10000,10000,20000,45000,78000,55000,80000]
})
The expected result is like this:
Note the NaN
value when we don't have 3 full months of data.
+--------+-------+--------------------------+
|customer|account|avg_invoices_last_3_months|
+--------+-------+--------------------------+
|C1000 |A1100 |41,333 |
|C1000 |A1200 |NaN |
...
I tried this transformation:
# Count how many rows has in each month
df['cnt_month'] = df.groupby(['customer','account']).transform('count')
# At this point, both columns receive NaN value, but the invoice column can't
# be change
df.loc[df.cnt_month < 4] = 'NaN'
# Here, I need to group by customer and account for invoice values
# also NaN and numeric in the grouping like in the "The expected result"
df.groupby(['customer','account','month'])['invoice'].mean()
But, the result isn't working.
I think you could substitute 'NaN', and then tally it with 'invoice'. Does it meet the intent of the question?
df.loc[df.cnt_month < 4, 'invoice'] = np.nan
df.groupby(['customer','account'])['invoice'].agg(avg_invoices_last_3_months=('invoice','mean')).reset_index()
customer account avg_invoices_last_3_months
0 C1000 A1100 NaN
1 C1000 A1200 NaN
2 C1000 A1300 NaN
3 C2000 A2100 64500.0