I am trying to create a descriptive statistics table, obtaining for each column of a data frame the following statistics: mean, sd, 10th, 50th, and 90th quantiles. I then want to transpose the dataset so that the columns are the different statistics, and each row is a variable in the dataset.
Here is a sample dataset:
dt <- data.frame(id = 1:100,
Numeric_Column_1 = rnorm(100),
Numeric_Column_2 = rnorm(100),
Numeric_Column_3 = rnorm(100),
Numeric_Column_4 = rnorm(100),
Numeric_Column_5 = rnorm(100))
and the code that should generate the table:
desc_table <- dt %>% select(-id) %>%
dplyr::summarise_all(.funs = list(mean=mean(.,na.rm=T),
sd=sd(.,na.rm=T),
P10=~quantile(., c(0.1), na.rm=T),
P50=~quantile(., c(0.5), na.rm=T),
P90=~quantile(., c(0.9), na.rm=T)),
na.rm=TRUE) %>%
pivot_longer(cols = everything()) %>%
separate(name,c("Variable", "Stat"),sep = "_") %>%
pivot_wider(names_from = "Stat", values_from = "value") %>%
mutate(mean = round(mean, 2), sd= round(sd, 2))
However I get the following error:
Error in is.data.frame(x): 'list' object cannot be coerced to type 'double' In addition: Warning message: In mean.default (., na.rm = T): argument is not numeric or logical: returning NA
How can I fix this?
Try this, modifying your code to fit the modern idiom and changing the separator in your <colname><separator><statistic>
idiom from "_"
to "."
to avoid a clash with your column names (which might have been the source of your error)...
dt %>%
dplyr::summarise(
across(
-id,
list(
mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE),
P10 = \(x) quantile(x, 0.1, na.rm = TRUE),
P50 = \(x) quantile(x, 0.5, na.rm = TRUE),
P90 = \(x) quantile(x, 0.9, na.rm = TRUE)
),
.names = "{.col}.{.fn}"
)
) %>%
pivot_longer(
everything(),
names_sep = "\\.",
names_to = c("Variable", "Stat")
) %>%
pivot_wider(names_from = "Stat", values_from = "value") %>%
mutate(mean = round(mean, 2), sd= round(sd, 2))
# A tibble: 5 × 6
Variable mean sd P10 P50 P90
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Numeric_Column_1 -0.04 0.94 -1.20 -0.0872 1.11
2 Numeric_Column_2 -0.15 1.03 -1.46 -0.107 1.07
3 Numeric_Column_3 0.11 1.01 -1.53 0.229 1.14
4 Numeric_Column_4 0.09 1.05 -1.17 0.103 1.53
5 Numeric_Column_5 -0.02 1.02 -1.34 -0.0238 1.38
The use of .names
in the across
call removes the need for your separate
step.
It might be better in the long run to drop the last element of the pipe and replace it with knitr::kable(digits = 2)
. This maintains the internal accuracy of your summary whilst formatting it as you request for display.
Also, see this page for an explanation of why you should use TRUE
and FALSE
rather than T
and F
.