I have a dataframe df
as below:
id na_count task q1 q2 q3 q4 q5
7 3 a 1 NA NA 2 NA
7 1 b 1 0 0 NA 0
7 3 c NA NA 1 NA 1
9 0 a 1 1 0 2 1
9 1 b 1 0 0 1 NA
9 0 c 1 1 0 1 0
9 1 d 1 0 NA 1 1
3 3 a 1 NA NA 1 NA
3 1 b 1 1 NA 2 1
1 2 b 1 1 NA 1 NA
1 2 c 1 1 NA 1 NA
1 3 d NA NA 1 NA 1
2 4 a 1 NA NA NA NA
2 2 b 1 2 NA 1 NA
2 1 c 1 1 2 NA 2
2 1 d NA 1 3 3 3
2 0 e 2 2 3 3 4
I am interested in adding a binary column or flag evidence
which is computed by looking at data per id
and then finding whether that id
meets a minimum threshold of non-NA values.
As an example, I have my minimum non-NA threshold set to 10. So if for any id
there are at least 10 non-NA values (over multiple rows) then I want to set evidence to Yes
, otherwise I want to set evidence to No
(Prefered) If possible, I want to use the count of non-NA values from the column na_count
rather than actually computing NAs over the columns q1:q5
For the example with the threshold of 10 non-NA, my output would be as below:
id na_count task q1 q2 q3 q4 q5 evidence
7 3 a 1 NA NA 2 NA no
7 1 b 1 0 0 NA 0 no
7 3 c NA NA 1 NA 1 no
9 0 a 1 1 0 2 1 yes
9 1 b 1 0 0 1 NA yes
9 0 c 1 1 0 1 0 yes
9 1 d 1 0 NA 1 1 yes
3 3 a 1 NA NA 1 NA no
3 1 b 1 1 NA 2 1 no
1 2 b 1 1 NA 1 NA no
1 2 c 1 1 NA 1 NA no
1 3 d NA NA 1 NA 1 no
2 4 a 1 NA NA NA NA yes
2 2 b 1 2 NA 1 NA yes
2 1 c 1 1 2 NA 2 yes
2 1 d NA 1 3 3 3 yes
2 0 e 2 2 3 3 4 yes
I have tried the following, but it just counts the rows not the non-NA values over multiple rows for that id.
library(dplyr)
df = df %>%
group_by(id) %>%
mutate(rows = n())
The following posts are related but do not address my problem How to make n() do not count NAs too in tidyverse?, Taking a count() after group_by() for non-missing values and Count number of non-NA values by group
For coding, I am also copying the dput()
of the dataframe
# dput(df)
structure(list(
id = c(7L, 7L, 7L, 9L, 9L, 9L, 9L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
na_count = c(3L, 1L, 3L, 0L, 1L, 0L, 1L, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 1L, 1L, 0L),
task = c("a", "b", "c", "a", "b", "c", "d", "a", "b", "b", "c", "d", "a", "b", "c", "d", "e"),
q1 = c(1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 2L),
q2 = c(NA, 0L, NA, 1L, 0L, 1L, 0L, NA, 1L, 1L, 1L, NA, NA, 2L, 1L, 1L, 2L),
q3 = c(NA, 0L, 1L, 0L, 0L, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, 3L, 3L),
q4 = c(2L, NA, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, NA, NA, 1L, NA, 3L, 3L),
q5 = c(NA, 0L, 1L, 1L, NA, 0L, 1L, NA, 1L, NA, NA, 1L, NA, NA, 2L, 3L, 4L)),
row.names = c(NA, -17L), class = "data.frame")
Any help on this would be greatly appreciated, thanks!
library(tidyverse)
threshold = 10
df %>% group_by(id) %>%
mutate(evidence = ifelse(n()*5 - sum(na_count) >= threshold, "yes", "no"))
The 5 comes from the number of columns you have, q1:q5.