I am trying to create a report on data validation in R; I have used the validate package to generate the general summary of the data, but I need to get the specifics of what is failing our validation checks.
What I want to end up with is a data frame of ids, columns that fail their test and the value that is failing the test. However, not all of the columns are mandatory, so I need to be able to check whether the data passes without knowing whether or not the column is going to be there.
For other data frames with mandatory data, I converted it to True/False whether it passes the tests. For example:
library(dplyr)
library(validate)
library(tidyr)
test_df = data.frame(id = 1:10,
a = 11:20,
b = c(21:25,36,27:30),
c = c(41,52,43:50))
text_check = test_df %>% transmute(
a = a>21,
b = b > 31,
c = c> 51
)
value_fails<-data.frame(id = test_df$id, text_check[,-1][colSums(text_check[,-1]) > 0])
value_failures_gath = gather(value_fails, column, changed, -id) %>% filter(changed == TRUE)
value_failures_gath$Value = apply(value_failures_gath, c(1), function(x)
test_df[test_df$id == x[['id']], grep(x[['column']], colnames(test_df))])
value_failures_gath<-value_failures_gath %>% arrange(id, column)
value_failures_gath$changed<-NULL
colnames(value_failures_gath)<-c('ID','Field','Value')
> value_failures_gath
ID Field Value
1 2 c 52
2 6 b 36
I have a data frame with the checks I want to create, in the style of:
second_data_check = data.frame(a = 'a>21',
b = 'b > 31',
c = 'c> 51',
d = 'd> 61')
I can't just run these as are, since we don't have column D to check, but other data frames that are run through this validation might have column D but not have column B for example. I can filter this data frame to only include the tests for the columns we have but then is there a way to apply the tests in this data frame as checks? Is there a better way to do this?
Thanks so much for the help!
I would set up the checks one at a time so that you can check variable existence before evaluation. Would the following solution work?
text_check = data.frame(id=test_df$id)
if('a' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(a=a>21)
text_check <- cbind(text_check, text_check_temp)
}
if('b' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(b=b>31)
text_check <- cbind(text_check, text_check_temp)
}
if('c' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(c=c>51)
text_check <- cbind(text_check, text_check_temp)
}
if('d' %in% colnames(test_df)){
text_check_temp = test_df %>% transmute(d=d>61)
text_check <- cbind(text_check, text_check_temp)
}
I was trying to further refactor the code by looping through the transmute checks but was unable to figure out how to evaluate string formulas properly.
Jason