I have monthly customer records for a particular period (say from 2000-10 till 2005-03 inclusive). Most customers have one record per month for each of the months in this period. This would be, in total, 54 records for this period.
However, some have some of this data missing.
I would like to produce stats showing how many customers have
missing records for this period. How to do that?
I am including a tiny subset of the data. It shows that customer 2 doesn't have a record for 2001-02.
df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
'volume' : [1,2,3,4,5,6,7,8,9,10,12],
'num_transactions': [3,4,5,6,7,8,9,10,11,12,13]})
period = df.period.unique() # OR pd.period_range('2000-10', '2001-03', freq='M')
df_missing_percent = 1 - df.cust_id.value_counts() / period.size
or if you want to account for null values in existing records too:
df_missing_percent = df.pivot(columns='period', index='cust_id').T.isna().mean()
then:
sum(df_missing_percent == 0) # 1
sum(df_missing_percent < 0.01) # 1
sum(df_missing_percent < 0.05) # 1
sum(df_missing_percent < 0.10) # 1
sum(df_missing_percent > 0.10) # 1