I would like to create a table of mean and sd for multiple variables for grouped data. However, the data has NA
s, so I need to include the na.rm =T
command.
Using iris
as a MWE, altered to include NA
s:
irisalt = iris
irisalt[1,1] =NA
irisalt[52,2] =NA
irisalt[103,3]= NA
First attempt:
irisalt%>%
group_by(Species)%>%
summarise(count = n(),
across(contains("."), c("mean" = mean, "sd" = sd))
)
Species count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 NA NA 3.43 0.379 1.46 0.174
2 versicolor 50 5.94 0.516 NA NA 4.26 0.470
3 virginica 50 6.59 0.636 2.97 0.322 NA NA
This is the table I need, but I want to calculate the means and sd by removing the NA.
Second attempt:
irisalt%>%
group_by(Species)%>%
drop_na()%>%
summarise(count = n(),
across(contains("."), c("mean" = mean, "sd" = sd))
)
This removes the whole rows where there are NA, and so alters the means for variables where there is data.
3rd attempt:
irisalt%>%
group_by(Species)%>%
summarise(count = n(),
across(contains("."), c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
)
Error in `summarise()`:
i In argument: `across(...)`.
Caused by error in `is.data.frame()`:
! 'list' object cannot be coerced to type 'double'
Fourth attempt:
irisalt%>%
group_by(Species)%>%
summarise(count = n(),
across(contains("."), ~c("mean" = mean(., na.rm = T), "sd" = sd(., na.rm =T)))
)
Species count Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <int> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.00 3.43 1.46 0.246
2 setosa 50 0.356 0.379 0.174 0.105
3 versicolor 50 5.94 2.76 4.26 1.33
4 versicolor 50 0.516 0.311 0.470 0.198
5 virginica 50 6.59 2.97 5.54 2.03
6 virginica 50 0.636 0.322 0.555 0.275
Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
i Please use `reframe()` instead.
i When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust
accordingly.
These are the numbers I need, but I need a single row per group (Species
), with a separate column for mean and sd for each variable, like in my first attempt
Looks like you are close, but your syntax is a bit off:
library(dplyr)
irisalt %>%
group_by(Species) %>%
summarise(count = n(),
across(contains("."), list(mean = ~ mean(., na.rm = T),
sd = ~ sd(., na.rm =T)))
)
From the documentation ?across
A named list of functions or lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x)). Each function is applied to each column, and the output is named by combining the function name and the column name using the glue specification in .names.
Note: as of dplyr 1.1.0 summarize
now has a .by
argument that is experimental but allows for one-time group computations like you have here. So you do not need to pipe to group_by
.
Output
Species count Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean Sepal.Width_sd Petal.Length_mean Petal.Length_sd Petal.Width_mean Petal.Width_sd
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 50 5.00 0.356 3.43 0.379 1.46 0.174 0.246 0.105
2 versicolor 50 5.94 0.516 2.76 0.311 4.26 0.470 1.33 0.198
3 virginica 50 6.59 0.636 2.97 0.322 5.54 0.555 2.03 0.275