Search code examples
rdplyrautomationconditional-statements

Automating conditional logic for database data checks in R


I am trying to do a large data check for a database. Some fields in the database are hidden, so when I am doing the datacheck, I need to ignore all hidden fields. Fields are hidden based on conditional logic stored in the database. I have exported this conditional logic and have stored it in a dataframe in R. Now I need to automate the data check by somehow using the text string of a conditional argument to automate the script writing itself, which I do not think is possible, or finding a way around this problem.

Below is example code that I need to solve:

id <- c(1001, 1002, 1003, 1004, 1005, 1001, 1002, 1003, 1004, 1005)
target_var <- c("race","race","race","race","race", "race_other", 
         "race_other", "race_other", "race_other", "race_other")
value <- c(1, NA, 1, 1, 6, NA, NA, NA, NA, "Asian")
branching_logic <- c(NA, NA, NA, NA, NA, 
                     "race == 6", "race == 6", "race == 6", 
                     "race == 6", "race == 6")
race <- c(NA, NA, NA,NA, NA, 1, 1, 1, 6, 6)

data <- data.frame(id, var, value, branching_logic, race) %>%
  mutate(data_check_result = case_when(
    !is.na(value) ~ "No Missing Data", 
    is.na(value) & is.na(branching_logic) ~ "Missing Data 1",
    is.na(value) & race == 6 ~ "Missing Data 2", 
    is.na(value) & race != 6 ~ "Hidden field",
  ))

It would be great if I could replace (race==6) with a variable or somehow directing the script to the conditional expression already saved as a string, but I know that R can't do that.

The above problem has four categories which the data could fall into:

  • No Missing Data: only if value is non-na
  • Missing Data 1: if the value is NA, and there is no branching logic that hid the variable.
  • Missing Data 2: if the value is NA and the branching logic is met to show the field
  • Hidden Field: if the value is NA and the branching logic is NOT net to show the field

I have thousands of fields to check with accompanying branching logic, so I need a way to use the branching logic saved in the "branching_logic" column within the script.

IMPORTANT NOTE: The case here is the simplest case. Many target_var variables and value variables have branching logic that looks at multiple other variables to determine whether to hide the field (Ex. race==6 & race==1)

This is only my second time posting, and I usually do not see such in depth problems here, but it would be great if someone has an idea!


Solution

  • You can store the expression you want to evaluate as a string if you pass it into parse() first as explained in this answer.

    Here's a simple example of how you can store the expression in a column and then feed it to dplyr::case_when().

    library(tidyverse)
    
    set.seed(1)
    d <- tibble(
      a = sample(10),
      b = sample(10),
      c = "a >  b"
    )
    
    d %>% 
      mutate(a_bigger = case_when(
        eval(parse(text = c)) ~ "Y",
        TRUE ~ "N"
      ))
    #> # A tibble: 10 x 4
    #>        a     b c      a_bigger
    #>    <int> <int> <chr>  <chr>   
    #>  1     9     3 a >  b Y       
    #>  2     4     1 a >  b Y       
    #>  3     7     5 a >  b Y       
    #>  4     1     8 a >  b N       
    #>  5     2     2 a >  b N       
    #>  6     5     6 a >  b N       
    #>  7     3    10 a >  b N       
    #>  8    10     9 a >  b Y       
    #>  9     6     4 a >  b Y       
    #> 10     8     7 a >  b Y
    

    Created on 2022-03-07 by the reprex package (v2.0.1)