I am trying to understand a particular behavior of the histogram of samples generated from rnorm
.
set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)
plot.hist <- function(vec, title, brks) {
h <- hist(vec, breaks = brks, density = 10,
col = "lightgray", main = title)
xfit <- seq(min(vec), max(vec), length = 40)
yfit <- dnorm(xfit, mean = mean(vec), sd = sd(vec))
yfit <- yfit * diff(h$mids[1:2]) * length(vec)
return(lines(xfit, yfit, col = "black", lwd = 2))
}
par(mfrow = c(2, 2))
plot.hist(x1, title = 'Sample = 1E3', brks = 100)
plot.hist(x2, title = 'Sample = 1E4', brks = 500)
plot.hist(x3, title = 'Sample = 1E5', brks = 1000)
plot.hist(x4, title = 'Sample = 1E6', brks = 1000)
You will notice that in each case (I am not making cross comparison; I know that as sample size gets larger the match between histogram and the curve is better), the histogram approximates the standard normal better towards the tails, but poorer towards the mode. Simply put, I'm trying to understand why each histogram is rougher in the middle compared to the tails. Is this an expected behavior or have I missed something basic?
This is not just true for normal samples. If we had fixed bins (rather than data-determined ones as we normally do) and we condition on the total number of observations, then the counts would be multinomial.
The expected value of the count in bin i is then n·p(i) where p(i) is the proportion of the population density that falls in bin (i).
The variance of the count in bin i would then be n·p(i)·(1-p(i)). With many bins and a smooth non-peaky density like a normal, (1-p(i)) will be very close to 1; p(i) will typically be smallish (much smaller that 1/2).
The variance of the count (and hence the standard deviation of it) is an increasing function of the expected height:
With a fixed bin width the height is proportional to the expected count and the standard deviation of the bin-height is an increasing function of the height.
So this motivates exactly what you see.
In practice it's not the case that the bin boundaries are fixed; as you add observations or generate a new sample they will change, but the number of bins changes fairly slowly as a function of the sample size (typically as the cube root, or sometimes as the log) and a more sophisticated analysis than the one here is required to get the exact form. However, the outcome is the same -- under commonly observed conditions the variance of the height of a bin typically increases monotonically with the height of the bin.