Search code examples

How to calculate np.cov on a matrix with np.nan values without converting to pd.DataFrame?

I have the following np.array:

my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])
array([[ 1., nan,  3.],
       [nan,  1.,  2.],
       [nan,  1.,  2.]])

If I evaluate np.cov on it, I get:

array([[nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan]])

But if I were to calculate it with pd.DataFrame.cov I get a different result:

    0   1   2
0   NaN NaN NaN
1   NaN 0.0 0.000000
2   NaN 0.0 0.333333

I know that as per pandas documentation, they handle nan values.

My question is, how can I get the same (or similar result) with numpy? Or how to handle missing data when calculating covariance with numpy?


  • You can make use of Numpy's masked arrays.

    import as ma
    cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
      data=[[--, --, --],
            [--, 0.0, 0.0],
            [--, 0.0, 0.33333333333333337]],
      mask=[[ True,  True,  True],
            [ True, False, False],
            [ True, False, False]],

    To produce an ndarray with nan values filled in, use the filled method.

    array([[       nan,        nan,        nan],
           [       nan, 0.        , 0.        ],
           [       nan, 0.        , 0.33333333]])

    Note that np.cov produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False to ma.cov.