I have a dataframe of categorical variables in Python Pandas:
import pandas as pd
cat_df = pd.DataFrame({'item':['bed', 'lamp', 'candle', 'chair', 'bed', 'candle', 'lamp'],
'location' :['home', 'home', 'home', 'home', 'home', 'home', 'home' ],
'status' :['new', 'used', 'used', 'new', 'new', 'used', 'new' ]})
cat_df = cat_df.astype('category')
print(cat_df.dtypes)
cat_df.describe().transpose()
and I need an output that shows all the values of categorical variables plus the frequency and percentage
# item bed 2 28.57
# item lamp 2 28.57
# item candle 2 28.57
# item chair 1 14.28
# location home 7 100
# status new 4 57.14
# status used 3 42.85
The real dataset has a lot of variables so I need a scalable solution.
Use DataFrame.melt
with GroupBy.size
for counts and then divide by sum
to new column by GroupBy.transform
:
df1 = cat_df.melt(value_name='v', var_name='c')
df = df1.groupby(['c', 'v']).size().reset_index(name='count')
df['perc'] = df['count'].div(df.groupby('c')['count'].transform('sum')).mul(100)
print (df)
c v count perc
0 item bed 2 28.571429
1 item candle 2 28.571429
2 item chair 1 14.285714
3 item lamp 2 28.571429
4 location home 7 100.000000
5 status new 4 57.142857
6 status used 3 42.857143