Search code examples
rbooleannamissing-databoolean-operations

checking if sum of logical variables is greater than n, with na, in r


I have a dataframe with 5 binary variables (TRUE or FALSE, but represented as 0 or 1 for convenience) which can have missing values:

df <- data.frame(a = c(1,0,1,0,0,...),
                 b = c(1,0,NA,0,1,...),
                 c = c(1,0,1,0,NA,...),
                 d = c(0,1,1,NA,NA,...),
                 e = c(0,0,0,1,1,...))
     a  b  c  d  e
 1   1  1  1  0  0
 2   0  0  0  1  0
 3   1 NA  1  1  0
 4   0  0  0 NA  1
 5   0  1 NA NA  1
...

Now I want to make a variable that indicates whether the observation satisfies more than two conditions out of the five, that is, whether the sum of a, b, c, d, and e is greater than 2.

For the first row and the second row, the values are obviously TRUE and FALSE respectively. For the third row, the value should be TRUE, since the sum is greater than 2 regardless of whether b is TRUE or FALSE. For the third row, the value should be FALSE, since the sum is less than or equal to 2 regardless of whether d is TRUE or FALSE. For the fifth row, the value should be NA, since the sum can range from 2 to 4 depending on c and d. So the desirable vector is c(TRUE, FALSE, TRUE, FALSE, NA, ...).

Here is my attempt:

df %>%
  mutate(a0 = ifelse(is.na(a), 0, a),
         b0 = ifelse(is.na(b), 0, b),
         c0 = ifelse(is.na(c), 0, c),
         d0 = ifelse(is.na(d), 0, d),
         e0 = ifelse(is.na(e), 0, e),
         a1 = ifelse(is.na(a), 1, a),
         b1 = ifelse(is.na(b), 1, b),
         c1 = ifelse(is.na(c), 1, c),
         d1 = ifelse(is.na(d), 1, d),
         e1 = ifelse(is.na(e), 1, e)
         ) %>%
  mutate(summin = a0 + b0 + c0 + d0 + e0,
         summax = a1 + b1 + c1 + d1 + e1) %>%
  mutate(f = ifelse(summax <= 2,
                    FALSE,
                    ifelse(summin >= 3, TRUE, NA)))

This did work, but I had to make too many redunant variables, plus the code would be too lengthy if there were more variables. Is there any better solution?


Solution

  • I just noticed that you want NA in case the outcome of the missing value will determine the TRUE/FALSE outcome, so I have changed the answer.

    Combining two if_else statements can first test if the row already have a sum of more than 2, and if not, check if the row sum plus the number of missing values is 2 or less.

    library(tidyverse)
    n <- 2
    want <- ifelse(rowSums(df, na.rm = TRUE) > n, 
                   TRUE, 
                   if_else((rowSums(df, na.rm = TRUE) + rowSums(is.na(df)))<=n,
                            FALSE, 
                            NA))
    

    If you want to stick to base-R you can use the function ifelse() instead.