Search code examples
rnarun-length-encoding

Extend runs of certain length


I have a 640 x 2500 dataframe with numeric values and several NA values. My goal is to find a minimum of 75 consecutive NA values in each row. For each such run, I want to replace the previous and following 50 cells with NA values too.

Here's a scaled down example of one row:

x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
#        run of four NA:  ^   ^   ^   ^     

I want to detect the run of four consecutive NA, and then replace three values before and three values after the run with NA:

c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3) 
#           ^   ^   ^                   ^   ^   ^

I have tried to first identify the consecutive NAs with rle, but running rle(is.na(df)) gives the error 'x' must be a vector of an atomic type. This occurs even when I select a single row.

Unfortunately, I do not know what the next steps to take would be in converting the previous and following 50 cells to NA.

Would highly appreciate any help on this, thanks in advance.


Solution

  • Because you comment that in your data "some [rows] begin and end with several NAs", hopefully this better represents the real data:

      A  B  C  D  E  F  G  H  I  J
    1 1  2  3 NA NA  6  7  8 NA 10
    2 1 NA NA NA  5  6  7 NA NA NA
    3 1  2  3  4 NA NA NA  8  9 10
    

    Let's assume that the minimum run length of NA to be expanded with NA is 2, and that two values before and two values after the run should be replaced with NA. In this example, row 2 would represent the case you mentioned in comment.

    First some data wrangling. I prefer to work with a data.table in long format. With data.table we have access to the useful constants .I and .N, and can easily create run IDs with rleid.

    # convert data.frame to data.table
    library(data.table)
    setDT(d)
    
    # set minimum length of runs to be expanded
    len = 2L
    
    # set number of values to replace on each side of run
    n = 2L
    
    # number of columns of original data (for truncation of indices)
    nc = ncol(d)
    
    # create a row index to keep track of the original rows in the long format
    d[ , ri := 1:.N]
    
    # melt from wide to long format
    d2 = melt(d, id.vars = "ri")
    
    # order by row index
    setorder(d2, ri)
    

    Now the actual calculations on the runs and their indices:

    d2[
      # check if the run is an "NA run" and has sufficient length 
      d2[ , if(anyNA(value) & .N >= len){
    
        # get indices before and after run, where values should be changed to NA  
        ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
    
        # truncate indices to keep them within (original) rows 
        ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
    
        # perform the calculation by row index and run index
        # grab replacement indices
        by = .(ri, rleid(is.na(value)))]$V1,
    
      # at replacement indices, set value to NA 
      value := NA]
    

    If desired, cast back to wide format

    dcast(d2, ri ~ variable, value.vars = "value")
    #    ri  A  B  C  D  E  F  G  H  I  J
    # 1:  1  1 NA NA NA NA NA NA  8 NA 10
    # 2:  2 NA NA NA NA NA NA NA NA NA NA
    # 3:  3  1  2 NA NA NA NA NA NA NA 10