Search code examples
pythonnumpyvariance

How to calculate efficiently the variance or standard deviation given a counter of numbers?


Given for example the next histogram and the bins:

import numpy as np
hist = np.array([1,1,2,1,2])
bins = np.array([0,1,2,3,4 ])

¿What is the most efficient way to calculate the variance? One way would be to recreate the array and pass it to the np.var function:

import numpy as np
np.var(np.array([0, 1, 2, 2, 3, 4, 4]))

However, I think this is not very efficient.


Solution

  • So you can just rewrite the formula:

    counts  = hist.sum()
    mean = (hist*bins).sum() / counts
    
    sum_squares = (bins**2 * hist).sum()
    var = sum_squares/counts - mean ** 2
    
    # test
    np.isclose(var, np.var(np.repeat(bins, hist)))
    

    Output True.