Suppose we have this dataframe with six observations and four variables
df <- data.frame(a = c(1, NA, NA, 4, NA, 5),
b = c(NA, NA, NA, NA, NA, 1),
c = c(1, 2, 3, 4, NA, 6),
d = c(6, 7, NA, NA, 4, 4))
a | b | c | d |
---|---|---|---|
1 | NA | 1 | 6 |
NA | NA | 2 | 7 |
NA | NA | 3 | NA |
4 | NA | 4 | NA |
NA | NA | NA | 4 |
5 | 1 | 6 | 4 |
How can we retain observations whose NA's does not exceed 50% of the variables? (In this case each observation left will have two NA's at most; thus only 4 observations will be retained.)
You use rowSums()
to count up the NAs in each row. Then you discard the rows with more than threshold*ncol(df)
NAs in their row.
threshold <- 0.5
df <- df[-which(rowSums(is.na(df)) > threshold*ncol(df)), ]