I am dealing with around 700 large datasets (with ~ 80 variables and up to 200k rows per dataset). Most variables are only allowed to be within a certain range or to be certain values. There are some errors in each dataset with values that do not adhere to these rules.
So, somehow I need to find those errors and make them NA. Currently I am doing this the following way:
library(data.table)
dataset$variable[!dataset$variable %in% c(1,2,3)] <- NA
# or
dataset$variable[!(dataset$variable >= 0 &
dataset$variable <= 150)] <- NA
... # for 80 variables
My issue is that applying these rules takes some time (up to 10 secs) and I would like to speed up that process. Do you have any ideas how to make it more efficient? Or is this already the simplest way and I have to deal with it?
Thank you so much in advance!
Edit: I am wondering for example: Would it be faster to iterate over the rows instead of the columns?
If you are looking for pure performance, I suggest you check out the data.table
package, which enhances data.frame
s in R. data.table
can do replacements in place, by reference (without copying). When you invoke the assignment operation (<-
) in R, I believe that a copy of the data is made, which causes slowdowns for large data sets.
A data.table
solution for your problem:
library(data.table)
# Cast as a data.table
dataset <- as.data.table(dataset)
# Do the replacement by reference with the := operation in data.table
# See ?`:=`
dataset[!variable %in% c(1,2,3), variable := NA]
dataset[variable >= 0 & variable <= 150, variable := NA]
This should be much faster than replacing values in a regular data.frame
with [<-