Search code examples
pythonsampling

Create sampling distribution using randomly drawn indicies in another column Python


I'm trying to solve this problem

Create a sampling distribution for our 30-observation-mean estimator of the mean of C. To do so, randomly draw 1000 sets of 30 observations from C, using the randomly drawn indices in D (the first row are first 30 randomly-drawn indexes, the second row are the second 30 randomly-drawn indexes, etc). For each random draw, compute the mean. Then plot the histogram of the distribution. Compare the distribution to np.mean(C).

Where C is

array([23, 23, 23, ..., 68, 34, 42])

and size of C is 100030 and D column is (with size 30000)

array([[23989, 10991, 81533, ..., 75050, 13817, 47678],
       [54864, 54830, 89396, ..., 22709, 14556, 62298],
       [ 2936, 28729,  4404, ..., 21431, 81187, 49178],
       ...,
       [30737, 12974, 41031, ..., 43003, 61132, 33385],
       [64713, 53207, 49529, ..., 72596, 76406, 15207],
       [29503, 71648, 27210, ..., 31298, 47102, 13024]])

I'm trying to understand the problem here and how to solve it. What I have done so far is initializing a list with zeros and trying to get the mean based on the indices in D. But I'm not sure if this is what is actually asked for? Any help?

 samp = np.zeros( (1000, 1))
    for i in np.arrange(0, 1000):
       samp(i) = np.mean(C( D(i,)))

also, this is taking random samples from C but not sure how to add D indices to it?

means_size_30 = []
for x in range(1000):
    mean = np.random.choice(C, size = 30).mean()
    means_size_30.append(mean)
means_size_30 = np.array(means_size_30)
plt.hist(means_size_30);

Solution

  • You can directly access the values of C by using the indexes provided in D. If you use the 2-dimensional array D to access values of the 1-dimensional array C, the resulting array will have the same shape as D: 2-dimensional. It will have 1000 rows with each row having 30 samples from C.

    In the next step you just have to calculate the mean over each row (set axis=1):

    means_size_30 = C[D].mean(axis=1)
    plt.hist(means_size_30)
    plt.axvline(np.mean(C))