python numpy statistics numerical-methods

Welford Variance Differs from Numpy Variance

I want to use Welford's method to compute a running variance and mean. I came across this implementation of Welford's method in Python. However, when testing to double-check that it results in the same output as the standard Numpy implementation of calculating variance, I do find that there is a difference in output.

Running the following code (using the python module unittest) shows that the two give different results (even after testing many times):

random_sample = np.random.normal(0, 1, 100)
std = np.var(random_sample, dtype=np.longdouble)
mean = np.mean(random_sample, dtype=np.longdouble)
welford = Welford()
welford.add_all(random_sample)

self.assertAlmostEqual(mean, welford.mean)
self.assertAlmostEqual(var, welford.var_s)

>> AssertionError: 1.1782075496578717837 != 1.1901086360180526 within 7 places (0.011901086360180828804 difference)

Interestingly, there is only a difference in the variance, not the mean.

For my purposes, a 0.012 difference is significant enough that it could affect my results.

Why would there be such a difference? Could this be due to compounding floating point errors? If so, would my best bet be to rewrite the package to use the Decimal class?

Solution

By default, np.var computes the so-called "population variance", in which the number of degrees of freedom is equal to the number of elements in the array.

wellford.var_s is the sample variance, in which the number of degrees of freedom is the number of elements in the array minus one.

To eliminate the discrepancy, pass ddof=1 to np.var:

import numpy as np
from welford import Welford
random_sample = np.random.normal(0, 1, 100)
var = np.var(random_sample, dtype=np.longdouble, ddof=1)
welford = Welford()
welford.add_all(random_sample)
np.testing.assert_allclose(var, welford.var_s)

Alternatively, if it is appropriate to use the population variance in your application, use welford.var_p.

var = np.var(random_sample, dtype=np.longdouble)
np.testing.assert_allclose(var, welford.var_p)

For a description of the difference between the two, see the development version of the np.var documentation.