Search code examples
rstatisticssimulationdistributionsample

How to simulate the sampling distribution?


I'm trying to gain a deeper understanding of the sampling distribution, and I've been working through some simulations to that end. For this exercise, the distribution I'm working with is a log-normal distribution with mean=0.1 and sigma=0.17. My code is below:

n_sims <- 1000

mu <- rep(NA, n_sims)
lo95 <- rep(NA, n_sims)
hi95 <- rep(NA, n_sims)

data <- rlnorm(1000, 0.1, 0.17)

for (i in 1:n_sims){
  sim <- sample(data, 1000)
  mu[i] <- mean(sim)
  lo95[i] <- mean(sim) - 2*sd(sim)
  hi95[i] <- mean(sim) + 2*sd(sim)
}

xs <- seq(1,n_sims,1)

plot(xs, mu, pch=16, ylim = c(min(lo95)-0.05, max(hi95)+0.05))
segments(xs, lo95, xs, hi95, lwd = 0.5, col = "gray")

sum((lo95 <= 1.1) & (hi95 >= 1.1))

I'm expecting 95% of the samples to contain the true value of the distribution (1.1 on the transformed scale), but the last line of code reveals that all of the 1000 samples contain the true mean? My understanding is that only 95% of these simulations should contain the correct mean. Is there something I'm not understanding?


Solution

  • The bug is located here: sample(data, 1000).
    The default for the sample function is "replace=FALSE" thus every iteration is using the same exact samples. To properly bootstrap your analysis you need to sample with replacement: sim <- sample(data, 1000, replace=TRUE).

    Also to calculate the confidence limits of your estimated mean, I believe you want to use mu +/- 2*sd/sqrt(n), where n is the number of samples.