I need to calcualte the sample variance of a data-set till the n-th element e.g.
x = np.random.randint(1, 7, 10)
--> [5 2 2 5 3 5 2 5 4 2]
The fast and easy way is using np.var(x) or a implementation of Welfords algorithm but those only calcualte the variance for the whole data-set. For my aplication i need the variance element wise in an array so that in the n-th element it would be the variance with the first n-th data-points from the data-set.
For example:
x_var[2]
--> variance of [5 2 2]
--> 1.7320508
x_var[9]
--> variance of [5 2 2 5 3 5 2 5 4 2]
--> 2.0555556
My solution is to silce the array in to n arrays so that i can just use np.var on each of them for the running variance. This works but is incredibly slow.
for i in range(0,n):
x_var[i] = np.var(x[:i])
I already have a fast implementation of a running mean , so i have an array with the mean till the n-th element in the n-th entry, if that helps.
How would you solve this efficiently and accurately without silcing the array in n pieces?
A simple way is to use pandas
with expanding
and var(ddof=0)
:
import numpy as np
import pandas as pd
x = np.array([5, 2, 2, 5, 3, 5, 2, 5, 4, 2])
pd.Series(x).expanding().var(ddof=0).to_numpy()
output:
array([0. , 2.25 , 2. , 2.25 , 1.84 ,
1.88888889, 1.95918367, 1.984375 , 1.77777778, 1.85 ])