I am trying the following code for estimating the variance in a sample, and compare it to the numpy.var implementation.
import numpy as np
def rcov(xj, (i, Mi, Si)):
j = i + 1
Mj = Mi + (xj - Mi) / j
Sj = Si + (i/j) * (xj - Mi) ** 2
return (j, Mj, Sj)
def mycov(X):
s = (0., 0, 0)
for i in xrange(len(X)):
s = rcov(X[i], s)
return s[-1] / (len(X) - 1) # sample covariance
X = np.random.rand(1000000)
N = 1e+15
print
print 'Sample (co)variance with 1 dof.'
print '-------------------------------'
print 'np.var(X) ', np.var(X, ddof=1)
print 'mycov(X) ', mycov(X)
print 'np.var(X+N) ', np.var(X+N, ddof=1)
print 'mycov(X+N) ', mycov(X+N)
And the output is
Sample (co)variance with 1 dof.
-------------------------------
np.var(X) 0.0833039106062
mycov(X) 0.0833039106062
np.var(X+N) 19208514.2744
mycov(X+N) 0.0859324294763
raising two questions:
For the sake of speed, NumPy does not check for arithmetic overflow or underflow. It's the user's responsibility to choose dtypes which are large enough to maintain the desired level of precision throughout all computations.
Using NumPy 1.8, and selecting X
to be of dtype longdouble
, rather than the default float64
,
X = np.random.rand(1000000).astype('longdouble')
yields
Sample (co)variance with 1 dof.
-------------------------------
np.var(X) 0.0832200737104
mycov(X) 0.0832200737104
np.var(X+N) 0.0832200805148
mycov(X+N) 0.0832199500372