Search code examples
rnasparklyrdplyrrowsum

Create an indicator variable in SparklyR when all the variables are missing


I am trying to use rowSum in sparklyr to create an indicator variable where all the variables are missing but it seems that rowSum doesn't work in sparklyr.

I have to write the name of all the variables in is.na() function like below which is impossible since I have 100 variables.

y <- c(NA,1,2)
x <- c(NA,NA,3)
z <- c(NA,NA,NA)
dt = data.frame(x,y,z)

sdf_copy_to(sc, dt)

dt %>% 
 mutate(new = ifelse(is.na(x) & is.na(y) & is.na(z), 1,0))

Is there anyway to write multiple variables in is.na() function?


Solution

  • library(rlang)
    library(glue)
    
    1. create a string with all the variable names of interest. I am calling all of them for simplicity; use regex (e.g., grep) otherwise

      cols_of_interest <- names(dt)
      
      
      test_string <-  glue("ifelse({glue('is.na({cols_of_interest})') %>% 
      glue_collapse(sep = '&')}, yes = 1, no = 0)")
      
    2. parse the string with rlang

      dt %>% mutate(flag = !!rlang::parse_expr(test_string))