Search code examples
pandassizecorrelation

how to get number of elements from pandas df correlation


I have:

df=pd.DataFrame({'A':[1,2,3,np.NaN,435,546],
             'B':[10,2,3,4,867,23],
             'C':[4,5,np.NaN, np.NaN,np.NaN,64]})
df


    A       B   C
0   1.0     10  4.0
1   2.0     2   5.0
2   3.0     3   NaN
3   NaN     4   NaN
4   435.0   867 NaN
5   546.0   23  64.0

I compute correlation with df.corr() which is returns the correlation matrix. According to documentation, correlation removes the NaN's, this when computing the correlation(A,B) there is 5 values to chose from, while correlation(A,C) has 3 values.

I ran this to get the number of elements based on each pairing.

for i in range(df.shape[1]):
  for j in range(df.shape[1]):
    if j==i: continue
    print(df.columns[i],df.columns[j],df.iloc[:,np.r_[i,j]].dropna().shape)
A B (5, 2)
A C (3, 2)
B A (5, 2)
B C (3, 2)
C A (3, 2)
C B (3, 2)

How can I transform that so that I can get it in a similar matrix to the one using df.corr()

    A           B           C
A   1.000000    0.508726    0.999916
B   0.508726    1.000000    0.920458
C   0.999916    0.920458    1.000000

Solution

  • Are you looking for the number of common non-nan:

    s = df.notna().astype(int)
    
    s.T @ s
    

    Output:

       A  B  C
    A  5  5  3
    B  5  6  3
    C  3  3  3