Search code examples
rdataframercppsequencesna

Find sequences of NA values in data.frame rows


I have a huge data.frame with several NA values in it. It seems that I get problems, if many NA values occur sequently.

Is there an easy way to find those rows in which NA values occur e.g. 20 times one after another, but not the ones where 20 NA values occur isolated?

EDIT (added by agstudy)

The accepted solution uses apply which is not very efficient for hudge matrix. So I edit the solution (I add the Rcpp tag) to ask for more efficient solution.


Solution

  • You can create a function anlagous to complete.cases that computes consecutive missings values using rle:

    cons.missings <- 
    function(dat,n)
    apply(is.na(dat),1,function(x){
      yy <- rle(x)
      any(yy$lengths[yy$values]>n)
    })
    

    Then to keep only good rows:

    dat[!cons.missings(dat,20),]
    

    Example with 4 connectives missings values:

    dat <- as.matrix(t(data.frame(a= c(1,rep(NA,4),5),
               b= c(2,rep(NA,2),1,rep(NA,2)))))
    
     [,1] [,2] [,3] [,4] [,5] [,6]
    a    1   NA   NA   NA   NA    5
    b    2   NA   NA    1   NA   NA
    
    dat[!cons.missings(dat,3),]
    [1]  2 NA NA  1 NA NA