Search code examples
rvalidationfilteringmutation

R Validating data only if column exists


I am trying to create a report on data validation in R; I have used the validate package to generate the general summary of the data, but I need to get the specifics of what is failing our validation checks.

What I want to end up with is a data frame of ids, columns that fail their test and the value that is failing the test. However, not all of the columns are mandatory, so I need to be able to check whether the data passes without knowing whether or not the column is going to be there.

For other data frames with mandatory data, I converted it to True/False whether it passes the tests. For example:

library(dplyr)
library(validate)
library(tidyr)

test_df = data.frame(id = 1:10, 
                 a = 11:20, 
                 b = c(21:25,36,27:30), 
                 c = c(41,52,43:50))

text_check = test_df %>% transmute(
      a = a>21,
      b = b > 31,
      c = c> 51
)

value_fails<-data.frame(id = test_df$id, text_check[,-1][colSums(text_check[,-1]) > 0])

value_failures_gath = gather(value_fails, column, changed, -id) %>% filter(changed == TRUE)
value_failures_gath$Value = apply(value_failures_gath, c(1), function(x) 
              test_df[test_df$id == x[['id']], grep(x[['column']], colnames(test_df))])
value_failures_gath<-value_failures_gath %>% arrange(id, column)
value_failures_gath$changed<-NULL

colnames(value_failures_gath)<-c('ID','Field','Value')

> value_failures_gath
  ID Field Value
1  2     c    52
2  6     b    36

I have a data frame with the checks I want to create, in the style of:

second_data_check = data.frame(a = 'a>21',
                           b = 'b > 31',
                           c = 'c> 51',
                           d = 'd> 61')

I can't just run these as are, since we don't have column D to check, but other data frames that are run through this validation might have column D but not have column B for example. I can filter this data frame to only include the tests for the columns we have but then is there a way to apply the tests in this data frame as checks? Is there a better way to do this?

Thanks so much for the help!


Solution

  • I would set up the checks one at a time so that you can check variable existence before evaluation. Would the following solution work?

    text_check = data.frame(id=test_df$id)
    
    if('a' %in% colnames(test_df)){
      text_check_temp = test_df %>% transmute(a=a>21)
      text_check <- cbind(text_check, text_check_temp)
    }
    if('b' %in% colnames(test_df)){
      text_check_temp = test_df %>% transmute(b=b>31)
      text_check <- cbind(text_check, text_check_temp)
    }
    if('c' %in% colnames(test_df)){
      text_check_temp = test_df %>% transmute(c=c>51)
      text_check <- cbind(text_check, text_check_temp)
    }
    if('d' %in% colnames(test_df)){
      text_check_temp = test_df %>% transmute(d=d>61)
      text_check <- cbind(text_check, text_check_temp)
    }
    

    I was trying to further refactor the code by looping through the transmute checks but was unable to figure out how to evaluate string formulas properly.

    Jason