Search code examples
rgroup-bydplyrrle

Replace consecutive repeat values based on different run lengths in R


Consider the following dataset:

dat<-data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3),
                var1 = c("A","NA","B","A","NA","NA","B","A","NA","NA","NA","C","A","NA","B","A","NA","NA","D","A","NA","NA","B"))

dat

First, I need to fill in all NA's with the values on either side of the NA's, which was successful in dplyr with:

mutate(value = ifelse(is.na(value), paste0(na.locf(value), "-", na.locf(value, fromLast=TRUE)), 
                        value))

This resulted in:

   id var1
1   1    A
2   1  A-B
3   1    B
4   1    A
5   1  A-B
6   1  A-B
7   1    B
8   2    A
9   2  A-C
10  2  A-C
11  2  A-C
12  2    C
13  2    A
14  2  A-B
15  2    B
16  3    A
17  3  A-D
18  3  A-D
19  3    D
20  3    A
21  3  A-B
22  3  A-B
23  3    B

However, I now need to keep some values while return others back to NA, based on the consecutive run lengths of the repeats (grouped by id col). If the consecutive repeat of A-B is longer than 1 then return all values to NA, and if the consecutive repeat of A-C is longer than 2 then return all values to NA, and if the consecutive repeat of A-D is longer than 3 then return all values to NA.

My desired outcome is:

   id var1
1   1    A
2   1  A-B
3   1    B
4   1    A
5   1   NA
6   1   NA
7   1    B
8   2    A
9   2   NA
10  2   NA
11  2   NA
12  2    C
13  2    A
14  2  A-B
15  2    B
16  3    A
17  3  A-D
18  3  A-D
19  3    D
20  3    A
21  3   NA
22  3   NA
23  3    B

I think this could be accomplished by some combo of group_by(id), then rle() or data.table's rleid(), then conditionally turn the values back to NA based on the value and the run length with case_when (I considered ifelse() but I have many more conditions than provided in the example and have read that case_when would be a better choice), but I have not been able to figure out how to write the precise code to do this. A similar question I've come across is Replace NA with previous value with limit however, it is a much simpler version of what I need to do.

Any suggestions would be much appreciated. I feel like I am close but I need assistance to get me to the desired result.


Solution

  • What you could do:

    myfun <- function(x){
      y <- rle(x)
      z <- match(y$values, LETTERS)
      ind <- which(is.na(z))
      m <- z[ind + 1] - z[ind - 1] >= y$lengths[ind]
      y$values[ind[m]] <- paste(y$values[ind[m] - 1], y$values[ind[m] + 1], sep = "-")
      inverse.rle(y)
    }
    
    
    transform(dat, var1 = ave(var1, id, FUN = myfun))
    
       id var1
    1   1    A
    2   1  A-B
    3   1    B
    4   1    A
    5   1   NA
    6   1   NA
    7   1    B
    8   2    A
    9   2   NA
    10  2   NA
    11  2   NA
    12  2    C
    13  2    A
    14  2  A-B
    15  2    B
    16  3    A
    17  3  A-D
    18  3  A-D
    19  3    D
    20  3    A
    21  3   NA
    22  3   NA
    23  3    B