I'm trying to calculate the mean, sd and coefficient of variation (cv) by tidyverse functions. No problem with mean or sd but with cv -as 100*(sd/mean)- it reports the following error:
Error in `dplyr::summarise()`:
! Problem while computing `..3 = across(where(is.numeric), (sd/mean) * 100, na.rm =
TRUE, .names = "{col}_cv")`.
ℹ The error occurred in group 1: group = 1.
Caused by error in `sd / mean`:
! non-numeric argument to binary operator
Run `rlang::last_error()` to see where the error occurred.
Take the following df as example (is not the real one, but the problem is the same):
df <- data.frame(group = as.factor(c(1,1,1,1,2,2,2,2)),
v1 = as.numeric(c(5,7,8,6,22,24,26,24)),
v2 = as.numeric(c(5,7,8,6,22,24,26,24)))
If I calculate the mean and sd there's no problem (except that I got two extra columns called v1_mean_sd and v2_mean_sd that are empty, but that's not my main problem):
df2 <- df %>%
group_by(group) %>% dplyr::summarise(
across(where(is.numeric), mean, na.rm = TRUE, .names = "{col}_mean"),
across(where(is.numeric), sd, na.rm = TRUE, .names = "{col}_sd"))
Then adding the calculation of cv to the code I got the mentioned error:
df2 <- df %>%
group_by(group) %>% dplyr::summarise(
across(where(is.numeric), mean, na.rm = TRUE, .names = "{col}_mean"),
across(where(is.numeric), sd, na.rm = TRUE, .names = "{col}_sd"),
across(where(is.numeric), (sd/mean)*100, na.rm = TRUE, .names = "{col}_cv")) # Here the cv
I would expect to have the result for the cv as I had for the mean or the sd.
Any suggestion?
Thanks in advance
With compound functions, instead of using the argument, create a lambda expression. Although, it is now recommended in tidyverse to use lambda even if we are passing a single function
library(dplyr)
df %>%
group_by(group) %>% dplyr::summarise(
across(where(is.numeric), mean, na.rm = TRUE, .names = "{col}_mean"),
across(where(is.numeric), sd, na.rm = TRUE, .names = "{col}_sd"),
across(where(is.numeric), ~
(sd(.x, na.rm = TRUE)/mean(.x, na.rm = TRUE))*100, .names = "{col}_cv"))
-output
# A tibble: 2 × 15
group v1_mean v2_mean v1_sd v2_sd v1_mean_sd v2_mean_sd v1_cv v2_cv v1_mean_cv v2_mean_cv v1_sd_cv v2_sd_cv v1_mean_sd_cv v2_mean_s…¹
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6.5 6.5 1.29 1.29 NA NA 19.9 19.9 NA NA NA NA NA NA
2 2 24 24 1.63 1.63 NA NA 6.80 6.80 NA NA NA NA NA NA
# … with abbreviated variable name ¹v2_mean_sd_cv
NOTE: There are some additional columns created because in each across
with the numeric columns selected (where(is.numeric)
), it selects all those columns that were already present in the initial data along with the new columns created in the across
steps. This could be avoided if we modify the where(is.numeric)
in the second and third across
to the specific column names or do this in a single across
df %>%
group_by(group) %>%
dplyr::summarise(across(where(is.numeric),
list(mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE),
cv = ~ sd(.x, na.rm = TRUE)/mean(.x, na.rm = TRUE) * 100)))
-output
# A tibble: 2 × 7
group v1_mean v1_sd v1_cv v2_mean v2_sd v2_cv
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6.5 1.29 19.9 6.5 1.29 19.9
2 2 24 1.63 6.80 24 1.63 6.80