Search code examples
rggplot2histogramnormal-distribution

fit a normal distribution to grouped data, giving expected frequencies


I have a frequency distribution of observations, grouped into counts within class intervals. I want to fit a normal (or other continuous) distribution, and find the expected frequencies in each interval according to that distribution.

For example, suppose the following, where I want to calculate another column, expected giving the expected number of soldiers with chest circumferences in the interval given by chest, where these are assumed to be centered on the nominal value. E.g., 35 = 34.5 <= y < 35.5. One analysis I've seen gives the expected frequency in this cell as 72.5 vs. the observed 81.

> data(ChestSizes, package="HistData")
> 
> ChestSizes
   chest count
1     33     3
2     34    18
3     35    81
4     36   185
5     37   420
6     38   749
7     39  1073
8     40  1079
9     41   934
10    42   658
11    43   370
12    44    92
13    45    50
14    46    21
15    47     4
16    48     1
> 

> # ungroup to a vector of values
> chests <- vcdExtra::expand.dft(ChestSizes, freq="count")

There are quite a number of variations of this question, most of which relate to plotting the normal density on top of a histogram, scaled to represent counts not density. But none explicitly show the calculation of the expected frequencies. One close question is R: add normal fits to grouped histograms in ggplot2

I can perfectly well do the standard plot (below), but for other things, like a Chi-square test or a vcd::rootogram plot, I need the expected frequencies in the same class intervals.

> bw <- 1
n_obs <- nrow(chests)
xbar <- mean(chests$chest)
std <- sd(chests$chest)

plt <-
ggplot(chests, aes(chest))  + 
  geom_histogram(color="black", fill="lightblue",  binwidth = bw) + 
  stat_function(fun = function(x) 
    dnorm(x, mean = xbar, sd = std) * bw * n_obs,
    color = "darkred", size = 1)

plt

enter image description here


Solution

  • here is how you could calculate the expected frequencies for each group assuming Normality.

    xbar <- with(ChestSizes, weighted.mean(chest, count))
    sdx <- with(ChestSizes, sd(rep(chest, count)))
    transform(ChestSizes, Expected = diff(pnorm(c(32, chest) + .5, xbar, sdx)) * sum(count))
    
       chest count     Expected
    1     33     3    4.7600583
    2     34    18   20.8822328
    3     35    81   72.5129162
    4     36   185  199.3338028
    5     37   420  433.8292832
    6     38   749  747.5926687
    7     39  1073 1020.1058521
    8     40  1079 1102.2356155
    9     41   934  943.0970605
    10    42   658  638.9745241
    11    43   370  342.7971793
    12    44    92  145.6089948
    13    45    50   48.9662992
    14    46    21   13.0351612
    15    47     4    2.7465640
    16    48     1    0.4579888