python python-3.x pandas numpy covariance

How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?

I have the following np.array:

my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])

array([[ 1., nan,  3.],
       [nan,  1.,  2.],
       [nan,  1.,  2.]])

If I evaluate np.cov on it, I get:

np.cov(my_matrix)

array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

But if I were to calculate it with pd.DataFrame.cov I get a different result:

pd.DataFrame(my_matrix).cov()

    0   1   2
0   NaN NaN NaN
1   NaN 0.0 0.000000
2   NaN 0.0 0.333333

I know that as per pandas documentation, they handle nan values.

My question is, how can I get the same (or similar result) with numpy? Or how to handle missing data when calculating covariance with numpy?

Solution

You can make use of Numpy's masked arrays.

import numpy.ma as ma
cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
cv

masked_array(
  data=[[--, --, --],
        [--, 0.0, 0.0],
        [--, 0.0, 0.33333333333333337]],
  mask=[[ True,  True,  True],
        [ True, False, False],
        [ True, False, False]],
  fill_value=1e+20)

To produce an ndarray with nan values filled in, use the filled method.

cv.filled(np.nan)

array([[       nan,        nan,        nan],
       [       nan, 0.        , 0.        ],
       [       nan, 0.        , 0.33333333]])

Note that np.cov produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False to ma.cov.