Search code examples
pythonpandascorrelationdata-manipulation

How to remove duplicates from correlation in pandas?


I have some problems with my result:

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

From my correlation matrix:

dataCorr = data.corr(method='pearson')

I convert this matrix to columns:

dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()

And after I remove diagonal line of matrix:

dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

But I still have duplicate pairs

level_0             level_1             0
LiftPushSpeed       RT1EntranceSpeed    0.881714
RT1EntranceSpeed    LiftPushSpeed       0.881714

How avoid this problem?


Solution

  • You can convert lower triangle of values to NaNs and stack remove it:

    np.random.seed(12)
    
    data = pd.DataFrame(np.random.randint(20, size=(5,6)))
    print (data)
        0   1   2  3   4   5
    0  11   6  17  2   3   3
    1  12  16  17  5  13   2
    2  11  10   0  8  12  13
    3  18   3   4  3   1   0
    4  18  18  16  6  13   9
    
    dataCorr = data.corr(method='pearson')
    dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
    print (dataCorr)
        0         1         2         3         4         5
    0 NaN  0.042609 -0.041656 -0.113998 -0.173011 -0.201122
    1 NaN       NaN  0.486901  0.567216  0.914260  0.403469
    2 NaN       NaN       NaN -0.412853  0.157747 -0.354012
    3 NaN       NaN       NaN       NaN  0.823628  0.858918
    4 NaN       NaN       NaN       NaN       NaN  0.635730
    5 NaN       NaN       NaN       NaN       NaN       NaN
    
    #in your data change 0.5 to 0.7
    dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
    print (dataCorr)
       level_0  level_1         0
    0        1        3  0.567216
    1        1        4  0.914260
    2        3        4  0.823628
    3        3        5  0.858918
    4        4        5  0.635730
    

    Detail:

    print (np.tril(np.ones(dataCorr.shape)))
    [[ 1.  0.  0.  0.  0.  0.]
     [ 1.  1.  0.  0.  0.  0.]
     [ 1.  1.  1.  0.  0.  0.]
     [ 1.  1.  1.  1.  0.  0.]
     [ 1.  1.  1.  1.  1.  0.]
     [ 1.  1.  1.  1.  1.  1.]]