I'm sampling from a lognormal distribution in R. When I look at mean and standard deviation of the resulting samples, I notice that the sampled standard deviation is consistently lower than the true population standard deviation. The same does not appear to be true for the means.
Is there a bias in the simulation sample statistics that I'm forgetting? Even if so, it seems that this bias is larger than I would have expected.
What I'm working with in R:
library(dplyr) ## Cleaning data
library(tidyr) ## tidying data
library(stringi) ## string manipulation
## Define simulation controls
n_sample <- 10
sample_size <- 1000
mu <- 10
sigma <- 3
## Lognormal mean and standard deviation
true_mean <- exp(mu + sigma ^ 2 / 2)
true_sd <- sqrt((exp(sigma ^ 2) - 1) *
exp(2 * mu + sigma ^ 2))
## For reporducibility
set.seed(42)
sample_id <- stri_rand_strings(n_sample, length = 5)
counts <- rep(sample_size, n_sample)
observations <- lapply(counts, rlnorm, meanlog = mu, sdlog = sigma)
names(observations) <- sample_id
## Summarize results of the n_sample-many simulations
obs_table <- observations %>%
bind_rows() %>%
gather(key = "sample",
value = "obs") %>%
group_by(sample) %>%
summarize(mean = mean(obs),
sd = sd(obs)) %>%
## Mean departure and SD departure from true
## underlying distribution.
mutate(mean_dep = mean / true_mean - 1,
sd_dep = sd / true_sd - 1)
obs_table
Observe your true_sd
value
> true_sd
[1] 178471287
This is way too large. The problem here is your sample size (1000) is too small compared to your distribution variance, hence you don't get good estimates of your population mean/sd with the sample statistics. The "bias" you observe (i.e. underestimating the variance most of the time) may happen because of the skewness and kurthosis characteristics of the distribution, but again, this is reduced as the sample size increases.
Hint: Try tweaking your sample size and parameters (mu and sigma) and check how your sample statistics relates to the "real" mean and sample deviation.