Search code examples
pythonpandasheatmapcorrelationcategorical-data

Correlation among multiple categorical variables


my original dataset

I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

DataFrame.corr(method='pearson', min_periods=1)

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

         cat1  cat2  cat3  
  cat1|  coef  coef  coef  
  cat2|  coef  coef  coef
  cat3|  coef  coef  coef

Any ideas with pd.pivot_table or something in the same vein?


Solution

  • You can using pd.factorize

    df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
    Out[32]: 
         a    c    d
    a  1.0  1.0  1.0
    c  1.0  1.0  1.0
    d  1.0  1.0  1.0
    

    Data input

    df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
    

    Update

    from scipy.stats import chisquare
    
    df=df.apply(lambda x : pd.factorize(x)[0])+1
    
    pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])
    
    Out[123]: 
         0    1    2    3
    0  0.0  0.0  0.0  0.0
    1  0.0  0.0  0.0  0.0
    2  0.0  0.0  0.0  0.0
    3  0.0  0.0  0.0  0.0
    
    df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})