I would like to generate some random numbers which are normally distributed. It’s not mission critical, so a simple algorithm will suffice. I would then like to supply my own mean and standard deviation.
From what I have been able to read, according to the Central Limit Theorem, I should be able to approximate normally distributed random numbers by adding random numbers together.
For example:
rand()+rand()+rand()+rand()+rand()+rand()
where rand()
results in an evenly distributed random number from 0 to 1
is a reasonable approximation. (I am aware that technically it’s 0 ≤ rand() < 1
).
The expected mean is 6*0.5
so I get to the desired mean with something like this:
(rand()+rand()+rand()+rand()+rand()+rand()-3) + mean
but what would the standard deviation be?
Once I know that, would setting an arbitrary standard deviation simply be a matter of multiplying?
Update
Experimentally, I have found that
(rand()+rand()+rand()+rand()+rand()+rand()-3)*sqrt(2)*sd+mean
gives me a set of data with the desired standard deviation and mean. I have tested this out in a database (PostgreSQL) with a 10 million rows, using the stddev()
and avg()
aggregate functions, and typical results are close to within 2 decimal places which isn’t too bad.
I have no idea why sqrt(2)
is involved …
Solution
OK, thanks to Severin Pappadeux below, I have an answer.
I can generate a reasonable result with:
(rand() + … + rand() - n/2) / sqrt(n/12) * sd + mean
where n
is the number of rand()
calls I am prepared to make.
From what I have been able to read, according to the Central Limit Theorem, I should be able to approximate normally distributed random numbers by adding random numbers together.
That is a correct approach. The only problem is to carefully analyze the tails you're missing.
Let's consider making N(0,1)
- gaussian distributed with mean 0 and std.deviation of 1. Then any other gaussian N(\mu, \sigma)
is just scale and shift from N(0,1)
.
So, proposed algorithm for G(0,1)
(which is an approximation for N(0,1)
) is
G(0,1) = U(0,1) + U(0,1) + U(0,1) + U(0,1) + U(0,1) + U(0,1)
where U(0,1)
is uniformly distributed random number in the [0,1) range. Lets take a look at the mean.
E(G(0,1)) = 6*E(U(1,0)) = 6*0.5 = 3
which is exactly what you've got. So, to get 0 mean for G(0,1)
we have to subtract 3. Lets now check the variance of the G(0,1), we have to make it equal to 1.
V(G(0,1)) = 6*V(U(1,0)) = 6*(1/12) = 1/2
Std.deviation (σ) is square root of variance, so to get it to 1 you have to divide by sqrt(1/2).
So, final expression would be
G(0,1) = (U(0,1) + U(0,1) + U(0,1) + U(0,1) + U(0,1) + U(0,1) - 3)/sqrt(1/2)
and it is reasonably good approximation of the N(0,1)
.
I have no idea why sqrt(2) is involved …
Dividing by sqrt(1/2) is the same as multiplying by sqrt(2) - now I hope you know where it came from.
Some simple corollary - for some other n
sum of U(0,1) variance multiplier will include term sqrt(n/12).
Another simple corollary - because V(U(0,1)) is equal to 1/12, then summing twelve U(0,1) will not require any multipliers
G(0,1) = Sum_1^12 U(0,1) - 6
is actually often cited in some old sampling recipes books/papers.
You might also want to take a look at related Irwin-Hall distribution and Bates distribution
UPDATE
I've thought about some simplification of the approach. Suppose we want to sum even number of U(0,1)
, so n=2m
. Again, talking about G(0,1)
as an approximation for N(0,1)
G(0,1) = (Sum_1^2m U(0,1) - m ) / sqrt(m/6)
Let's rewrite it as
G(0,1) = (Sum_1^m U(0,1) - (m - Sum_1^m U(0,1)))/sqrt(m/6) =
= (Sum_1^m U(0,1) - Sum_1^m(1 - U(0,1)))/sqrt(m/6)
Due to the fact, that 1 - U(0,1)
has the same distribution as U(0,1)
we could
write G(0,1)
in symmetric form
G(0,1) = (Sum_1^m U(0,1) - Sum_1^m U(0,1))/sqrt(m/6) =
= Sum_1^m (U(0,1) - U(0,1)) / sqrt(m/6)