What is the best way to calculate summary statistics for a matrix with factors (using R)?
Sample data:
df <- read.table(text="
Gender Obs1 Obs2 Obs3
1 F 0.0 -1.25 -1.23
2 M -0.5 0.83 -0.25
3 UNK 1.0 1.22 0.24
4 F 1.0 0.25 0.76
5 F 0.5 1.02 0.0
6 M 2.0 -0.05 0.66
7 M 1.0 5.80 NA
8 M 1.0 NA 1.0
9 UNK 0.0 NA 2.2", header=T)
df <- as.data.frame(df)
Expected output:
Gender | Obs1_count | Obs2_count | Obs3_count | Obs1_mean | Obs2_mean | Obs3_mean
---------------------------------------------------------------------------------
F 3 3 3 0.5 0.00667 -0.157
M 4 3 3 0.875 2.0433 0.47
UNK 2 1 2 0.5 1.22 1.22
What I've tried:
library(dplyr)
summaryStats <- df %>% group_by(Gender) %>% na.omit() %>%
summarise_all(.funs = funs(
count = length(.),
mean = mean(.)))
We can use sum
on a logical vector to get the 'count' of non-NA elements, the mean
also have an argument na.rm = TRUE
to remove those NA elements. So, we don't need na.omit()
. Though, it does remove NA
elements, it also remove a complete row whenever there is only a single NA
on that row. In effect, the 'count' gets reduced for some columns and 'mean' also changes based on the removal of rows with non-NA values
df %>%
group_by(Gender) %>%
summarise_all(.funs = funs(count = sum(!is.na(.)),
mean = mean(., na.rm = TRUE)))
# A tibble: 3 x 7
# Gender Obs1_count Obs2_count Obs3_count Obs1_mean Obs2_mean Obs3_mean
# <fct> <int> <int> <int> <dbl> <dbl> <dbl>
#1 F 3 3 3 0.5 0.00667 -0.157
#2 M 4 3 3 0.875 2.19 0.47
#3 UNK 2 1 2 0.5 1.22 1.22