Search code examples
rdata-cleaning

R - Cleaning dataset columns for NA,NaN,DIV/0


I have a dataset with 160 columns. Some of these columns contains a lot of NA and #DIV/0! I load the data in the following way:

training = read.csv("training.csv",header = TRUE,na.strings = c("NA","NaN","","#DIV/0!"))

How can I keep only columns that contains values in all rows?


Solution

  • @SRivero's answer works, here is another

    set.seed(1234)
    dat <- as.data.frame(matrix(runif(100000),1000,10))
    dat[sample(1:100,20,replace=TRUE),sample(1:10,3,replace=TRUE)] <- NA
    
    # apply through each column seeing if any are NAs
    dat[,sapply(dat,function(x) !any(is.na(x)))]
    
    # Check if both answers give same result
    all.equal(dat[,which(sapply(dat,function(x) !any(is.na(x))))],
    dat[ , colSums(is.na(dat)) == 0])
    [1] TRUE
    

    Though mine is a bit faster

    library(microbenchmark)
    
    microbenchmark(dat[,sapply(dat,function(x) !any(is.na(x)))],
    dat[ , colSums(is.na(dat)) == 0])
    Unit: microseconds
                                               expr     min       lq      mean   median      uq      max neval
     dat[, sapply(dat, function(x) !any(is.na(x)))]  87.464  89.7790  94.51491  90.9830  97.124  190.865   100
                    dat[, colSums(is.na(dat)) == 0] 197.958 199.9585 226.49657 201.4265 207.278 1382.612   100