Search code examples
rperformanceerror-correction

How to make checking values for errors more efficient


I am dealing with around 700 large datasets (with ~ 80 variables and up to 200k rows per dataset). Most variables are only allowed to be within a certain range or to be certain values. There are some errors in each dataset with values that do not adhere to these rules.

So, somehow I need to find those errors and make them NA. Currently I am doing this the following way:

library(data.table)
dataset$variable[!dataset$variable %in% c(1,2,3)] <- NA
# or
dataset$variable[!(dataset$variable >= 0 &
                   dataset$variable <= 150)] <- NA
... # for 80 variables

My issue is that applying these rules takes some time (up to 10 secs) and I would like to speed up that process. Do you have any ideas how to make it more efficient? Or is this already the simplest way and I have to deal with it?

Thank you so much in advance!


Edit: I am wondering for example: Would it be faster to iterate over the rows instead of the columns?


Solution

  • If you are looking for pure performance, I suggest you check out the data.table package, which enhances data.frames in R. data.table can do replacements in place, by reference (without copying). When you invoke the assignment operation (<-) in R, I believe that a copy of the data is made, which causes slowdowns for large data sets.

    A data.table solution for your problem:

    library(data.table)
    
    # Cast as a data.table
    dataset <- as.data.table(dataset)
    
    # Do the replacement by reference with the := operation in data.table
    # See ?`:=`
    dataset[!variable %in% c(1,2,3), variable := NA]
    dataset[variable >= 0 & variable <= 150, variable := NA]
    

    This should be much faster than replacing values in a regular data.frame with [<-