Search code examples
pythonstatisticsnumpynumerical-methods

Methods for quickly calculating standard deviation of large number set in Numpy


What's the best(fastest) way to do this?

question

This generates what I believe is the correct answer, but obviously at N = 10e6 it is painfully slow. I think I need to keep the Xi values so I can correctly calculate the standard deviation, but are there any techniques to make this run faster?

def randomInterval(a,b):
    r = ((b-a)*float(random.random(1)) + a)
    return r 

N = 10e6
Sum = 0
x = []
for sample in range(0,int(N)):
    n = randomInterval(-5.,5.)
    while n == 5.0:
        n = randomInterval(-5.,5.) # since X is [-5,5)
    Sum += n
    x = np.append(x, n)

A = Sum/N

for sample in range(0,int(N)):
    summation = (x[sample] - A)**2.0

standard_deviation = np.sqrt((1./N)*summation)

Solution

  • You made a decent attempt, but should make sure you understand this and don't copy explicitly since this is HW

    import numpy as np
    N = int(1e6)
    a = np.random.uniform(-5,5,size=(N,))
    standard_deviation = np.std(a)
    

    This assumes you can use a package like numpy (you tagged it as such). If you can, there are a whole host of methods that allow you to create and do operations on arrays of data, thus avoiding explicit looping (it's done under the hood in an efficient manner). It would be good to take a look at the documentation to see what features are available and how to use them:

    http://docs.scipy.org/doc/numpy/reference/index.html