Search code examples
pythonpandascorrelationpearson-correlation

Find high correlations in a large coefficient matrix


I have a dataset with 56 numerical features. Loading it to pandas, I can easily generate a correlation coefficients matrix.

However, due to its size, I'd like to find coefficients higher (or lower) than a certain threshold, e.g. >0.8 or <-0.8, and list the corresponding pairs of variables. Is there a way to do it? I figure it would require selecting by value across all columns, then returning, not the row, but the column name and row index of the value, but I have no idea how to do either!

Thanks!


Solution

  • I think you can do where and stack(): this:

    np.random.seed(1)
    df = pd.DataFrame(np.random.rand(10,3))
    
    coeff = df.corr()
    
    # 0.3 is used for illustration 
    # replace with your actual value
    thresh = 0.3
    
    mask = coeff.abs().lt(thresh)
    # or mask = coeff < thresh
    
    coeff.where(mask).stack()
    

    Output:

    0  2   -0.089326
    2  0   -0.089326
    dtype: float64
    

    Output:

    0  1    0.319612
       2   -0.089326
    1  0    0.319612
       2   -0.687399
    2  0   -0.089326
       1   -0.687399
    dtype: float64