I am trying to answer this question:
Assume that a sample is created from a standard normal distribution (μ= 0,σ= 1). Take sample lengths ranging from N = 1 to 600. For each sample length, draw 5000 samples and estimate the mean from each of the samples. Find the standard deviation from these means, and show that the standard deviation corresponds to a square root reduction.
I'm not sure if I am interpreting the question properly, but my goal is to find the standard deviation of the means for each sample length and then show that the decrease in standard deviation is similar to a square root reduction:
this is what I have so far (is what I'm doing making sense in relation to the problem?):
First making normal distribution and just plotting a simple one for reference:
import math
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from scipy.stats import norm, kurtosis, skew
from scipy import stats
n = np.arange(1,401,1)
mu = 0
sigma = 1
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)
pdf = stats.norm.pdf(x, mu, sigma)
# plot normal distribution
plt.plot(x,pdf)
plt.show()
now for the sample lengths etc and calculating the sdev and mean:
sample_means = []
sample_stdevs = []
for i in range(400):
rand_list = np.random.randint(1,400,1000) #samples ranging from values 1 - 400, and make a 1000 of them
sample_means.append(np.mean(rand_list))
sample_stdevs.append(np.std(sample_means))
plt.plot(sample_stdevs)
does this make sense?... also I am confused on the root reduction part.
Take sample lengths ranging from N = 1 to 400. For each sample length, draw 1000 samples and estimate the mean from each of the samples.
A sample of length 200 means drawing 200 sample points. Take its mean. Now do this 1000 time for N = 200 and you have 1000 means. Calculate the std of these 1000 means and it tells you the spread of these means. Do this for all N to see how this spread changes for different sample lengths.
The idea is that if you only draw 5 samples, it's quite likely their mean won't sit nicely near 0. If you collect 1000 of these means, they will vary wildly and you'll get a wide spread. If you collect a larger sample, due to the law of large numbers the mean will be very close to 0 and this will be reproducible even if you do this 1000 times. Therefore the spread of those means will be smaller.
The standard deviation of the mean is the standard deviation of the population (σ = 1 in our case) divided by the square root of the size of the sample we drew. See the wiki article for a derivation.
import numpy as np
import matplotlib.pyplot as plt
stdevs = []
lengths = np.arange(1, 401)
for length in lengths:
# mean = 0, std = 1 by default
sample = np.random.normal(size=(length, 1000))
stdevs.append(sample.mean(axis=0).std())
plt.plot(lengths, stdevs)
plt.plot(lengths, 1 / np.sqrt(lengths))
plt.legend(['Sampling', 'Theory'])
plt.show()
Output