Search code examples
pythonarraysnumpystatisticsvariance

Is there a fast (and accurate) way to calculate the sample variance of a data-set till the n-th element?


I need to calcualte the sample variance of a data-set till the n-th element e.g.

x = np.random.randint(1, 7, 10)
--> [5 2 2 5 3 5 2 5 4 2]

The fast and easy way is using np.var(x) or a implementation of Welfords algorithm but those only calcualte the variance for the whole data-set. For my aplication i need the variance element wise in an array so that in the n-th element it would be the variance with the first n-th data-points from the data-set.

For example:

x_var[2]
--> variance of [5 2 2]
--> 1.7320508
x_var[9]
--> variance of [5 2 2 5 3 5 2 5 4 2]
--> 2.0555556

My solution is to silce the array in to n arrays so that i can just use np.var on each of them for the running variance. This works but is incredibly slow.

for i in range(0,n):                                                
    x_var[i] = np.var(x[:i]) 

I already have a fast implementation of a running mean , so i have an array with the mean till the n-th element in the n-th entry, if that helps.

How would you solve this efficiently and accurately without silcing the array in n pieces?


Solution

  • A simple way is to use pandas with expanding and var(ddof=0):

    import numpy as np
    import pandas as pd
    
    x = np.array([5, 2, 2, 5, 3, 5, 2, 5, 4, 2])
    
    pd.Series(x).expanding().var(ddof=0).to_numpy()
    

    output:

    array([0.        , 2.25      , 2.        , 2.25      , 1.84      ,
           1.88888889, 1.95918367, 1.984375  , 1.77777778, 1.85      ])