Search code examples
rdataframenamissing-data

How to remove all columns that contain more than 2000 NA values?


I did look up a similar example which used

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, -which(colMeans(is.na(dat)) > 0.5)]

But I am not sure how to convert it into a number and not a percentage.


Solution

  • One base R option could be:

    dat[, colMeans(is.na(dat)) <= 0.5]
    
       X1 X2 X4 X5 X6 X8 X10
    1  NA 11 NA NA NA 71  NA
    2  NA 12 32 NA 52 72  NA
    3   3 NA 33 NA 53 73  93
    4   4 14 NA 44 NA NA  94
    5   5 15 35 NA 55 75  95
    6  NA NA 36 46 NA 76  NA
    7  NA NA NA 47 57 NA  97
    8   8 18 NA 48 NA 78  98
    9   9 NA 39 NA 59 79  99
    10 NA NA 40 50 NA 80 100
    

    Or using a specified number:

    dat[, colSums(is.na(dat)) <= 5]
    

    Or using half of the rows as a criteria:

    dat[, colSums(is.na(dat)) <= nrow(dat)/2]
    

    And the same idea with dplyr:

    dat %>%
     select_if(~ mean(is.na(.)) <= 0.5)
    

    Or using a specified number:

    dat %>%
     select_if(~ sum(is.na(.)) <= 5)
    

    Similarly, using half of the rows as a criteria:

    dat %>%
     select_if(~ sum(is.na(.)) <= length(.)/2)