I have very little knowledge of statistics, so forgive me, but I'm very confused by how the numpy function std
works, and the documentation is unfortunately not clearing it up.
From what I understand it will compute the standard deviation of a distribution from the array, but when I set up a Gaussian with a standard deviation of 0.5
with the following code, numpy.std
returns 0.2:
sigma = 0.5
mu = 1
x = np.linspace(0, 2, 100)
f = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-1 / 2) * ((x - mu) / sigma)**2)
plt.plot(x, f)
plt.show()
print(np.std(f))
This is the distribution:
I have no idea what I'm misunderstanding about how the function works. I thought maybe I would have to tell it the x-values associated with the y-values of the distribution but there's no argument for that in the function. Why is numpy.std
not returning the actual standard deviation of my distribution?
I suspect that you understand perfectly well how the function works, but are misunderstanding the meaning of your data. Standard deviation is a measure of the spread of data about the mean value.
When you say std(f)
, you are computing the the spread of the y-values about their mean. Looking at the graph in the question, a vertical mean of ~0.5 and a standard deviation of ~0.2 are not far fetched. Notice that std(f)
does not involve the x-values in any way.
What you are expecting to get is the standard deviation of the x-values, weighted by the y-values. This is essentially the idea behind a probability density function (PDF).
Let's go through the computation manually to understand the differences. The mean of the x-values is normally x.sum() / x.size
. But that is only true the the weight of each value is 1. If you weigh each value by the corresponding f
value, you can write
m = (x * f).sum() / f.sum()
Standard deviation is the root-mean-square about the mean. That means computing the average squared deviation from the mean, and taking the square root. We can compute the weighted mean of squared deviation in the exact same way we did before:
s = np.sqrt(np.sum((x - m)**2 * f) / f.sum())
Notice that the value of s
computed this way from your question is not 0.5, but rather 0.44. This is because your PDF is incomplete, and the missing tails add significantly to the spread.
Here is an example showing that the standard deviation converges to the expected value as you compute it for a larger sample of the PDF:
>>> def s(x, y):
... m = (x * y).sum() / y.sum()
... return np.sqrt(np.sum((x - m)**2 * y) / y.sum())
>>> sigma = 0.5
>>> x1 = np.linspace(-1, 1, 100)
>>> y1 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x1 / sigma)**2)
>>> s(x1, y1)
0.4418881290522094
>>> x2 = np.linspace(-2, 2, 100)
>>> y2 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x2 / sigma)**2)
>>> s(x2, y2)
0.49977093783005005
>>> x3 = np.linspace(-3, 3, 100)
>>> y3 = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * (x3 / sigma)**2)
>>> s(x3, y3)
0.49999998748515206