Search code examples
pythonpython-3.xpandasnumpycovariance

How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?


I have the following np.array:

my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])
array([[ 1., nan,  3.],
       [nan,  1.,  2.],
       [nan,  1.,  2.]])

If I evaluate np.cov on it, I get:

np.cov(my_matrix)
array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

But if I were to calculate it with pd.DataFrame.cov I get a different result:

pd.DataFrame(my_matrix).cov()
    0   1   2
0   NaN NaN NaN
1   NaN 0.0 0.000000
2   NaN 0.0 0.333333

I know that as per pandas documentation, they handle nan values.

My question is, how can I get the same (or similar result) with numpy? Or how to handle missing data when calculating covariance with numpy?


Solution

  • You can make use of Numpy's masked arrays.

    import numpy.ma as ma
    cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
    cv
    
    masked_array(
      data=[[--, --, --],
            [--, 0.0, 0.0],
            [--, 0.0, 0.33333333333333337]],
      mask=[[ True,  True,  True],
            [ True, False, False],
            [ True, False, False]],
      fill_value=1e+20)
    

    To produce an ndarray with nan values filled in, use the filled method.

    cv.filled(np.nan)
    
    array([[       nan,        nan,        nan],
           [       nan, 0.        , 0.        ],
           [       nan, 0.        , 0.33333333]])
    

    Note that np.cov produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False to ma.cov.