I have the following np.array
:
my_matrix = np.array([[1,np.nan,3], [np.nan,1,2], [np.nan,1,2]])
array([[ 1., nan, 3.],
[nan, 1., 2.],
[nan, 1., 2.]])
If I evaluate np.cov
on it, I get:
np.cov(my_matrix)
array([[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan]])
But if I were to calculate it with pd.DataFrame.cov
I get a different result:
pd.DataFrame(my_matrix).cov()
0 1 2
0 NaN NaN NaN
1 NaN 0.0 0.000000
2 NaN 0.0 0.333333
I know that as per pandas
documentation, they handle nan
values.
My question is, how can I get the same (or similar result) with numpy
? Or how to handle missing data when calculating covariance with numpy
?
You can make use of Numpy's masked arrays.
import numpy.ma as ma
cv = ma.cov(ma.masked_invalid(my_matrix), rowvar=False)
cv
masked_array(
data=[[--, --, --],
[--, 0.0, 0.0],
[--, 0.0, 0.33333333333333337]],
mask=[[ True, True, True],
[ True, False, False],
[ True, False, False]],
fill_value=1e+20)
To produce an ndarray
with nan
values filled in, use the filled
method.
cv.filled(np.nan)
array([[ nan, nan, nan],
[ nan, 0. , 0. ],
[ nan, 0. , 0.33333333]])
Note that np.cov
produces pairwise row covariances by default. To replicate Pandas behavior (pairwise column covariances), you must pass rowvar=False
to ma.cov
.