Search code examples
pythonpandasmatrixdataframeadjacency-matrix

Most efficient way to create non-redundant correlation matrix Python?


I feel like numpy, scipy, or networkx has a method to do this but I just haven't figured it out yet.

My question is how to create a nonredundant correlation matrix in the form of a DataFrame on from a redundant correlation matrix for LARGE DATASETS in the MOST EFFICIENT way (In Python)?

I'm using this method on a 7000x7000 matrix and it's taking forever on my MacBook Air 4GB Ram (I know, I definitely shouldn't use this for programming but that's another discussion)

Example of redundant correlation matrix

enter image description here

Example of nonredundant correlation matrix

enter image description here

I gave a pretty naive way of doing it below but there has to be a better way. I like storing my matrices in sparse matrices and converting them to dataframes for storage purposes.

import pandas as pd
import numpy as np
import networkx as nx

#Example DataFrame
L_test = [[0.999999999999999,
  0.374449352805868,
  0.000347439531148995,
  0.00103026903356954,
  0.0011830950375467401],
 [0.374449352805868,
  1.0,
  1.17392596672424e-05,
  1.49428208843456e-07,
  1.216664263989e-06],
 [0.000347439531148995,
  1.17392596672424e-05,
  1.0,
  0.17452569907144502,
  0.238497202355299],
 [0.00103026903356954,
  1.49428208843456e-07,
  0.17452569907144502,
  1.0,
  0.7557000865939779],
 [0.0011830950375467401,
  1.216664263989e-06,
  0.238497202355299,
  0.7557000865939779,
  1.0]]
labels = ['AF001', 'AF002', 'AF003', 'AF004', 'AF005']
DF_1 = pd.DataFrame(L_test,columns=labels,index=labels)

#Create Nonredundant Similarity Matrix
n,m = DF_test.shape #they will be the same since it's adjacency
#Empty array to fill
A_tmp = np.zeros((n,m)) 
#Copy part of the array
for i in range(n):
    for j in range(m):
        A_tmp[i,j] = DF_test.iloc[i,j]
        if j==i:
            break
#Make array sparse for storage
A_csr = csr_matrix(A_tmp) 
#Recreate DataFrame
DF_2 = pd.DataFrame(A_csr.todense(),columns=DF_test.columns,index=DF_test.index) 
DF_2.head()

Solution

  • I think you can create array with np.tril and then multiple it with DataFrame DF_1:

    print np.tril(np.ones(DF_1.shape))
    [[ 1.  0.  0.  0.  0.]
     [ 1.  1.  0.  0.  0.]
     [ 1.  1.  1.  0.  0.]
     [ 1.  1.  1.  1.  0.]
     [ 1.  1.  1.  1.  1.]]
    
    print np.tril(np.ones(DF_1.shape)) * DF_1
              AF001         AF002     AF003   AF004  AF005
    AF001  1.000000  0.000000e+00  0.000000  0.0000      0
    AF002  0.374449  1.000000e+00  0.000000  0.0000      0
    AF003  0.000347  1.173926e-05  1.000000  0.0000      0
    AF004  0.001030  1.494282e-07  0.174526  1.0000      0
    AF005  0.001183  1.216664e-06  0.238497  0.7557      1