I want to create a table that counts up the number of non-NA values in each column of a dataset. I'm using summarize_all()
, but I'm struggling to get a count of non-NA values.
I referred to this StackOverflow thread to get some insight, but it didn't seem to help me obtain a count: summarize_all with "n()" function.
I'll either get a count of distinct values if I pass summarize_all(n_distinct)
or I'll get the number of rows if I pass summarize_all(list(n=~n()))
or summarize_all(list(n="length")
as the thread cites.
My desired output:
ID Female Male Non_Binary
5 5 4 3
Where am I going wrong?
# Sample Code
test<-as_tibble(data.frame(`ID` = c("1","2","3","4","5"),
`Female` = c("Female","Female","Female","Female","Female"),
`Male` = c(NA,"Male","Male","Male","Male"),
`Non_Binary`=c("Non-Binary","Non-Binary","Non-Binary",NA,NA)))
## Attempt 1
summary<-test%>%
summarize_all(list(n=~n()))
# A tibble: 1 × 4
ID_n Female_n Male_n Non_Binary_n
<int> <int> <int> <int>
1 5 5 5 5
## Attempt 2
summary<-test%>%
summarize_all(list(n="length"))
# A tibble: 1 × 4
ID_n Female_n Male_n Non_Binary_n
<int> <int> <int> <int>
1 5 5 5 5
## Attempt 3
summary<-test%>%
summarize_all(n_distinct)
# A tibble: 1 × 4
ID Female Male Non_Binary
<int> <int> <int> <int>
1 5 1 2 2
### Desired Output
ID Female Male Non_Binary
5 5 4 3
n()
and length()
are poor choices for this - they don't omit NA
values from what they count. The classic way to count values that meet conditions (such as "not NA") is to sum
the condition.
summarize_all()
has also been deprecated for a few years. Now it is preferred to use across()
.
test |>
summarize(across(everything(), \(x) sum(!is.na(x))))
# # A tibble: 1 × 4
# ID Female Male Non_Binary
# <int> <int> <int> <int>
# 1 5 5 4 3