Search code examples
pythonpandascategorical-datadata-wrangling

Descriptive statistics for categorical variables in Python Pandas


I have a dataframe of categorical variables in Python Pandas:

import pandas as pd
cat_df = pd.DataFrame({'item':['bed',  'lamp', 'candle', 'chair', 'bed',  'candle', 'lamp'],
                  'location' :['home', 'home', 'home',   'home',  'home', 'home',   'home' ],
                  'status'   :['new',  'used', 'used',   'new',   'new',  'used',   'new' ]})

cat_df = cat_df.astype('category')
print(cat_df.dtypes)

cat_df.describe().transpose()

and I need an output that shows all the values of categorical variables plus the frequency and percentage

# item     bed    2 28.57
# item     lamp   2 28.57
# item     candle 2 28.57
# item     chair  1 14.28
# location home   7 100
# status   new    4 57.14
# status   used   3 42.85

The real dataset has a lot of variables so I need a scalable solution.


Solution

  • Use DataFrame.melt with GroupBy.size for counts and then divide by sum to new column by GroupBy.transform:

    df1 = cat_df.melt(value_name='v', var_name='c')
    df = df1.groupby(['c', 'v']).size().reset_index(name='count')
    df['perc'] = df['count'].div(df.groupby('c')['count'].transform('sum')).mul(100)
    print (df)
              c       v  count        perc
    0      item     bed      2   28.571429
    1      item  candle      2   28.571429
    2      item   chair      1   14.285714
    3      item    lamp      2   28.571429
    4  location    home      7  100.000000
    5    status     new      4   57.142857
    6    status    used      3   42.857143