Search code examples
pythonpandascategorical-datapandas-melt

Finding probabilities of each value in all categorical columns across a dataframe


My question is nearly identical to Finding frequency of each value in all categorical columns across a dataframe, but I need the probabilities, instead of the frequencies. We can use the same example dataframe:

df = pd.DataFrame(
    {'sub_code': ['CSE01', 'CSE01', 'CSE01', 'CSE02', 'CSE03', 'CSE04',
                  'CSE05', 'CSE06'],
     'stud_level': [101, 101, 101, 101, 101, 101, 101, 101],
     'grade': ['STA', 'STA', 'PSA', 'STA', 'STA', 'SSA', 'PSA', 'QSA']})

I tried adapting this answer in the following way:

out = (df.select_dtypes(object)
       .melt(var_name="Variable", value_name="Class")
       .value_counts(dropna=False, normalize=True)
       .reset_index(name="Probability")
       .sort_values(by=['Variable', 'Class'], ascending=[True, True])
       .reset_index(drop=True))

However, the code doesn't work, because the sum of the class probabilities for each variable is not 1. What am I doing wrong?


Solution

  • IIUC, you need to to groupby on 'Variable' after melting. This will give you the probabilities for each value in 'Class' in relation to each value in 'Variable'.

    out = (
        df.select_dtypes(object)
        .melt(var_name="Variable", value_name="Class")
        .groupby("Variable")  # add this line
        .value_counts(dropna=False, normalize=True)
        .reset_index(name="Probability")
        .sort_values(by=["Variable", "Class"], ascending=[True, True])
        .reset_index(drop=True)
    )
    
       Variable  Class  Probability
    0     grade    PSA        0.250
    1     grade    QSA        0.125
    2     grade    SSA        0.125
    3     grade    STA        0.500
    4  sub_code  CSE01        0.375
    5  sub_code  CSE02        0.125
    6  sub_code  CSE03        0.125
    7  sub_code  CSE04        0.125
    8  sub_code  CSE05        0.125
    9  sub_code  CSE06        0.125