I have a huge data.frame
with several NA
values in it. It seems that I get problems, if many NA
values occur sequently.
Is there an easy way to find those rows in which NA
values occur e.g. 20 times one after another, but not the ones where 20 NA
values occur isolated?
EDIT (added by agstudy)
The accepted solution uses apply
which is not very efficient for hudge matrix. So I edit the solution (I add the Rcpp
tag) to ask for more efficient solution.
You can create a function anlagous to complete.cases
that computes consecutive missings values using rle
:
cons.missings <-
function(dat,n)
apply(is.na(dat),1,function(x){
yy <- rle(x)
any(yy$lengths[yy$values]>n)
})
Then to keep only good rows:
dat[!cons.missings(dat,20),]
Example with 4 connectives missings values:
dat <- as.matrix(t(data.frame(a= c(1,rep(NA,4),5),
b= c(2,rep(NA,2),1,rep(NA,2)))))
[,1] [,2] [,3] [,4] [,5] [,6]
a 1 NA NA NA NA 5
b 2 NA NA 1 NA NA
dat[!cons.missings(dat,3),]
[1] 2 NA NA 1 NA NA