Search code examples
pythonnumpypandascorrelation

Pandas: How to drop self correlation from correlation matrix


I'm trying to find highest correlations for different columns with pandas. I know can get correlation matrix with

df.corr()

I know I can get the highest correlations after that with

df.sort() 
df.stack() 
df[-5:]

The problem is that these correlation also contain values for column with the column itself (1). How do I remove these columns that contain correlation with self? I know I can remove them by removing all 1 values but I don't want to do that as there might be actual 1 correlations too.


Solution

  • I recently found even cleaner answer to my question, you can compare multi-index levels by value.

    This is what I ended using.

    corr = df.corr().stack()
    corr = corr[corr.index.get_level_values(0) != corr.index.get_level_values(1)]