Search code examples
pythonpandasgroup-by

Binning based on percentage missing data


I have monthly customer records for a particular period (say from 2000-10 till 2005-03 inclusive). Most customers have one record per month for each of the months in this period. This would be, in total, 54 records for this period.

However, some have some of this data missing.

I would like to produce stats showing how many customers have

  • 0%
  • less than 1%
  • less than 5%
  • less than 10%
  • more than 10%

missing records for this period. How to do that?

I am including a tiny subset of the data. It shows that customer 2 doesn't have a record for 2001-02.

df=pd.DataFrame({'cust_id': [1,1,1,1,1,1,2,2,2,2,2],
                 'period' : [200010,200011,200012,200101,200102,200103,200010,200011,200012,200101,200103],
                 'volume' : [1,2,3,4,5,6,7,8,9,10,12],
                 'num_transactions': [3,4,5,6,7,8,9,10,11,12,13]})

Solution

  • period = df.period.unique() # OR pd.period_range('2000-10', '2001-03', freq='M')
    df_missing_percent = 1 - df.cust_id.value_counts() / period.size
    

    or if you want to account for null values in existing records too:

    df_missing_percent = df.pivot(columns='period', index='cust_id').T.isna().mean()
    

    then:

    sum(df_missing_percent == 0)    # 1
    sum(df_missing_percent < 0.01)  # 1
    sum(df_missing_percent < 0.05)  # 1
    sum(df_missing_percent < 0.10)  # 1
    sum(df_missing_percent > 0.10)  # 1