Search code examples
rdplyrsummarize

Mean and sd per group for multiple variables when NAs present


I would like to create a table of mean and sd for multiple variables for grouped data. However, the data has NAs, so I need to include the na.rm =T command.

Using iris as a MWE, altered to include NAs:

irisalt = iris
irisalt[1,1] =NA
irisalt[52,2] =NA
irisalt[103,3]= NA

First attempt:

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

 Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd
  <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>
1 setosa        50             NA             NA                 3.43          0.379              1.46           0.174
2 versicolor    50              5.94           0.516            NA            NA                  4.26           0.470
3 virginica     50              6.59           0.636             2.97          0.322             NA             NA    

This is the table I need, but I want to calculate the means and sd by removing the NA.

Second attempt:

irisalt%>%
  group_by(Species)%>%
  drop_na()%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean, "sd" = sd))
  )

This removes the whole rows where there are NA, and so alters the means for variables where there is data.

3rd attempt:

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Error in `summarise()`:
i In argument: `across(...)`.
Caused by error in `is.data.frame()`:
! 'list' object cannot be coerced to type 'double'

Fourth attempt:

irisalt%>%
  group_by(Species)%>%
  summarise(count = n(),
            across(contains("."), ~c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
  )

Species    count Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <int>        <dbl>       <dbl>        <dbl>       <dbl>
1 setosa        50        5.00        3.43         1.46        0.246
2 setosa        50        0.356       0.379        0.174       0.105
3 versicolor    50        5.94        2.76         4.26        1.33 
4 versicolor    50        0.516       0.311        0.470       0.198
5 virginica     50        6.59        2.97         5.54        2.03 
6 virginica     50        0.636       0.322        0.555       0.275
Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust
  accordingly.

These are the numbers I need, but I need a single row per group (Species), with a separate column for mean and sd for each variable, like in my first attempt


Solution

  • Looks like you are close, but your syntax is a bit off:

    library(dplyr)
    
    irisalt %>%
      group_by(Species) %>%
      summarise(count = n(),
                across(contains("."), list(mean = ~ mean(., na.rm = T), 
                                           sd = ~ sd(., na.rm =T)))
      )
    

    From the documentation ?across

    A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)). Each function is applied to each column, and the output is named by combining the function name and the column name using the glue specification in .names.

    Note: as of dplyr 1.1.0 summarize now has a .by argument that is experimental but allows for one-time group computations like you have here. So you do not need to pipe to group_by.

    Output

     Species    count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd Petal.Width_mean Petal.Width_sd
      <fct>      <int>             <dbl>           <dbl>            <dbl>          <dbl>             <dbl>           <dbl>            <dbl>          <dbl>
    1 setosa        50              5.00           0.356             3.43          0.379              1.46           0.174            0.246          0.105
    2 versicolor    50              5.94           0.516             2.76          0.311              4.26           0.470            1.33           0.198
    3 virginica     50              6.59           0.636             2.97          0.322              5.54           0.555            2.03           0.275