Search code examples
rdplyrgroup-bygroup-summaries

Calculating standard errors of means by groups


What seems like a straightforward calculation of group means and standard errors doesn't work and I can't figure out why. I get NA for the std. errors

data("AddHealth", package = "heplots")
library(dplyr)

# use a function
std.error <- function(x, na.rm = TRUE) {
  sd(x, na.rm = na.rm) / sqrt(length(x))
}

# try it two ways
AddHealth |>
  group_by(grade) |>
  summarise(depression = mean(depression),
            anxiety = mean(anxiety),
            n = n(),
            se.dep = std.error(depression),
            se.anx = sd(anxiety, na.rm = TRUE) / sqrt(n)) |> 
  print()

I get NA for the standard error whether I use my function or do the calculation directly. This should be relatively simple. Why doesn't this work? How can I get what I want?

# A tibble: 6 × 6
  grade depression anxiety     n se.dep se.anx
  <ord>      <dbl>   <dbl> <int>  <dbl>  <dbl>
1 7          0.881   0.751   622     NA     NA
2 8          1.08    0.804   664     NA     NA
3 9          1.17    0.934   778     NA     NA
4 10         1.27    0.956   817     NA     NA
5 11         1.37    1.12    790     NA     NA
6 12         1.34    1.10    673     NA     NA

Solution

  • The issue is that you compute the means first and store the means using the names of the original variable. Hence, when computing the standard errors you are not computing the sd of the original variable but the sd of the mean, i.e. you are doing e.g. sd(mean(depression)) which returns NA.

    To fix your issue change the order of the computations, i.e. compute the standard errors first or use different names for the means or what I would prefer use across():

    data("AddHealth", package = "heplots")
    library(dplyr, warn=FALSE)
    
    AddHealth |>
      group_by(grade) |>
      summarise(
        n = n(),
        se.dep = sd(depression, na.rm = TRUE) / sqrt(n),
        se.anx = sd(anxiety, na.rm = TRUE) / sqrt(n),
        depression = mean(depression),
        anxiety = mean(anxiety)
      )
    #> # A tibble: 6 × 6
    #>   grade     n se.dep se.anx depression anxiety
    #>   <ord> <int>  <dbl>  <dbl>      <dbl>   <dbl>
    #> 1 7       622 0.0447 0.0420      0.881   0.751
    #> 2 8       664 0.0461 0.0411      1.08    0.804
    #> 3 9       778 0.0426 0.0387      1.17    0.934
    #> 4 10      817 0.0431 0.0388      1.27    0.956
    #> 5 11      790 0.0428 0.0411      1.37    1.12 
    #> 6 12      673 0.0439 0.0426      1.34    1.10
    
    AddHealth |>
      group_by(grade) |>
      summarise(
        n = n(),
        mean.dep = mean(depression),
        mean.anx = mean(anxiety),
        se.dep = sd(depression, na.rm = TRUE) / sqrt(n),
        se.anx = sd(anxiety, na.rm = TRUE) / sqrt(n)
      )
    #> # A tibble: 6 × 6
    #>   grade     n mean.dep mean.anx se.dep se.anx
    #>   <ord> <int>    <dbl>    <dbl>  <dbl>  <dbl>
    #> 1 7       622    0.881    0.751 0.0447 0.0420
    #> 2 8       664    1.08     0.804 0.0461 0.0411
    #> 3 9       778    1.17     0.934 0.0426 0.0387
    #> 4 10      817    1.27     0.956 0.0431 0.0388
    #> 5 11      790    1.37     1.12  0.0428 0.0411
    #> 6 12      673    1.34     1.10  0.0439 0.0426
    
    AddHealth |>
      summarise(
        n = n(),
        across(
          c(depression, anxiety),
          list(mean = mean, se = ~ sd(.x, na.rm = TRUE) / sqrt(n)),
          .names = "{.fn}.{substr(.col, 1, 3)}"
        ),
        .by = grade
      )
    #>   grade   n  mean.dep     se.dep  mean.anx     se.anx
    #> 1    11 790 1.3658228 0.04282760 1.1151899 0.04113153
    #> 2    10 817 1.2705018 0.04311396 0.9559364 0.03881506
    #> 3    12 673 1.3387816 0.04392934 1.0965825 0.04259964
    #> 4     7 622 0.8810289 0.04468291 0.7508039 0.04203107
    #> 5     8 664 1.0753012 0.04611600 0.8042169 0.04106970
    #> 6     9 778 1.1748072 0.04258433 0.9344473 0.03870629