Search code examples
pythonpandasdataframenumpyseries

Find event and non-event rate using pandas


I have a dataframe like as shown below

import numpy as np
import pandas as pd
np.random.seed(100)


df = pd.DataFrame({'grade': np.random.choice(list('ABCD'),size=(20)),
                   'dash': np.random.choice(list('PQRS'),size=(20)),
                   'dumeel': np.random.choice(list('QWER'),size=(20)),
                   'dumma': np.random.choice((1234),size=(20)),
                   'target': np.random.choice([0,1],size=(20))
})

I would like to do the below

a) event rate - Compute the % occurrence of 1s (from target column) for each unique value in a each of the input categorical column

b) non event rate - Compute the % occurrence of 0s (from target column) for each unique value in each of the input categorical columns

I tried the below

input_category_columns = df.select_dtypes(include='object')
df_rate_calc = pd.DataFrame()
for ip in input_category_columns:
    feature,target = ip,'target'
    df_rate_calc['col_name'] = (pd.crosstab(df[feature],df[target],normalize='columns'))

I would like to do this on a million rows and if there is any efficient approach, would really be helpful

I expect my output to be like as shown below. I have shown for only two columns but I want to produce this output for all categorical columns

enter image description here


Solution

  • Here is one approach:

    • Select the catgorical columns (cols)
    • Melt the dataframe with target as id variable and cols as value variables
    • Group the dataframe and use value_counts to calculate frequency
    • Unstack to reshape the dataframe
    cols = df.select_dtypes('object')
    df_out = (
        df.melt('target', cols)
          .groupby(['variable', 'target'])['value']
          .value_counts(normalize=True)
          .unstack(1, fill_value=0)
    )
    

    print(df_out)
    
    target            0    1
    variable value          
    dash     P      0.4  0.3
             Q      0.2  0.3
             R      0.2  0.1
             S      0.2  0.3
    dumeel   E      0.2  0.2
             Q      0.1  0.0
             R      0.4  0.6
             W      0.3  0.2
    grade    A      0.4  0.2
             B      0.0  0.2
             C      0.4  0.3
             D      0.2  0.3