Search code examples
rdplyrmultiple-columnsrowsna

row wise NA count across some columns - grouped by id


I have a dataframe df as below:

Input

id  na_count    task q1   q2   q3   q4  q5
7   3           a    1    NA   NA   2   NA
7   1           b    1    0    0    NA  0
7   3           c    NA   NA   1    NA  1
9   0           a    1    1    0    2   1
9   1           b    1    0    0    1   NA
9   0           c    1    1    0    1   0
9   1           d    1    0    NA   1   1
3   3           a    1    NA   NA   1   NA
3   1           b    1    1    NA   2   1
1   2           b    1    1    NA   1   NA
1   2           c    1    1    NA   1   NA
1   3           d    NA   NA   1    NA  1
2   4           a    1    NA   NA   NA  NA
2   2           b    1    2    NA   1   NA
2   1           c    1    1    2    NA  2
2   1           d    NA   1    3    3   3
2   0           e    2    2    3    3   4
  1. I am interested in adding a binary column or flag evidence which is computed by looking at data per id and then finding whether that id meets a minimum threshold of non-NA values.

  2. As an example, I have my minimum non-NA threshold set to 10. So if for any id there are at least 10 non-NA values (over multiple rows) then I want to set evidence to Yes, otherwise I want to set evidence to No

  3. (Prefered) If possible, I want to use the count of non-NA values from the column na_count rather than actually computing NAs over the columns q1:q5

Output

For the example with the threshold of 10 non-NA, my output would be as below:

id  na_count    task q1   q2   q3   q4  q5  evidence
7   3           a    1    NA   NA   2   NA  no
7   1           b    1    0    0    NA  0   no
7   3           c    NA   NA   1    NA  1   no
9   0           a    1    1    0    2   1   yes
9   1           b    1    0    0    1   NA  yes
9   0           c    1    1    0    1   0   yes
9   1           d    1    0    NA   1   1   yes
3   3           a    1    NA   NA   1   NA  no
3   1           b    1    1    NA   2   1   no
1   2           b    1    1    NA   1   NA  no
1   2           c    1    1    NA   1   NA  no
1   3           d    NA   NA   1    NA  1   no
2   4           a    1    NA   NA   NA  NA  yes
2   2           b    1    2    NA   1   NA  yes
2   1           c    1    1    2    NA  2   yes
2   1           d    NA   1    3    3   3   yes
2   0           e    2    2    3    3   4   yes

Partial solution

I have tried the following, but it just counts the rows not the non-NA values over multiple rows for that id.

library(dplyr)
df = df %>%
       group_by(id) %>%
           mutate(rows = n())

Related Posts

The following posts are related but do not address my problem How to make n() do not count NAs too in tidyverse?, Taking a count() after group_by() for non-missing values and Count number of non-NA values by group

dput()

For coding, I am also copying the dput() of the dataframe

# dput(df)

structure(list(
id = c(7L, 7L, 7L, 9L, 9L, 9L, 9L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
na_count = c(3L, 1L, 3L, 0L, 1L, 0L, 1L, 3L, 1L, 2L, 2L, 3L, 4L, 2L, 1L, 1L, 0L), 
task = c("a", "b", "c", "a", "b", "c", "d", "a", "b", "b", "c", "d", "a", "b", "c", "d", "e"), 
q1 = c(1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, NA, 2L), 
q2 = c(NA, 0L, NA, 1L, 0L, 1L, 0L, NA, 1L, 1L, 1L, NA, NA, 2L, 1L, 1L, 2L), 
q3 = c(NA, 0L, 1L, 0L, 0L, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, 2L, 3L, 3L), 
q4 = c(2L, NA, NA, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, NA, NA, 1L, NA, 3L, 3L), 
q5 = c(NA, 0L, 1L, 1L, NA, 0L, 1L, NA, 1L, NA, NA, 1L, NA, NA, 2L, 3L, 4L)), 
row.names = c(NA, -17L), class = "data.frame")

Any help on this would be greatly appreciated, thanks!


Solution

  • library(tidyverse)
    
    threshold = 10
    
    df %>% group_by(id) %>%
      mutate(evidence = ifelse(n()*5 - sum(na_count) >= threshold, "yes", "no"))
    

    The 5 comes from the number of columns you have, q1:q5.