Search code examples
pythonpandasdataframenumpycategorical-data

Finding frequency of each value in all categorical columns across a dataframe


I have a dataframe like as shown below

df = pd.DataFrame(
    {'sub_code' : ['CSE01', 'CSE01', 'CSE01', 
                   'CSE02', 'CSE03', 'CSE04',
                   'CSE05', 'CSE06'],
     'stud_level' : [101, 101, 101, 101, 
                  101, 101, 101, 101],
     'grade' : ['STA','STA','PSA','STA','STA','SSA','PSA','QSA']})

I would like to do the below

a) get the frequency of each unique value in all categorical columns of a dataframe

I tried the below but it is neither efficient nor elegant

df['sub_code'].value_counts() # need to key in column name manually
df['grade'].value_counts() # need to key in column name manually
df.select_dtypes(include='object').value_counts() #produces incorrect output

As my real data has more than 200 columns and 100k rows, is there any efficient approach to do this?

I expect my output to be like as shown below. If there is any other better way to show the below output, I welcome that as well. I don't know how else to capture this information in a neat manner. So, please do share your suggestions

enter image description here


Solution

  • Use melt and value_counts:

    out = (df.select_dtypes(object)
           .melt(var_name="Column", value_name="Value")
           .value_counts(dropna=False)
           .reset_index(name="Frequency")
           .sort_values(by=['Column','Frequency','Variable'], ascending=[True,False,True])
           .reset_index(drop=True))
    

    Output:

         Column Variable  Frequency
    0     grade      STA          4
    1     grade      PSA          2
    2     grade      QSA          1
    3     grade      SSA          1
    4  sub_code    CSE01          3
    5  sub_code    CSE02          1
    6  sub_code    CSE03          1
    7  sub_code    CSE04          1
    8  sub_code    CSE05          1
    9  sub_code    CSE06          1