Search code examples
rdata-cleaningnamissing-data

Retain observations whose NA is <= 20% of total variables


Suppose we have this dataframe with six observations and four variables

df <- data.frame(a = c(1, NA, NA, 4, NA, 5),
                 b = c(NA, NA, NA, NA, NA, 1),
                 c = c(1, 2, 3, 4, NA, 6),
                 d = c(6, 7, NA, NA, 4, 4))
a b c d
1 NA 1 6
NA NA 2 7
NA NA 3 NA
4 NA 4 NA
NA NA NA 4
5 1 6 4

How can we retain observations whose NA's does not exceed 50% of the variables? (In this case each observation left will have two NA's at most; thus only 4 observations will be retained.)


Solution

  • You use rowSums() to count up the NAs in each row. Then you discard the rows with more than threshold*ncol(df) NAs in their row.

    threshold <- 0.5
    
    df <- df[-which(rowSums(is.na(df)) > threshold*ncol(df)), ]