Search code examples
rdiffshift

Identify duplicate values and remove them


I have a vector:

vec <- c(2,3,5,5,5,5,6,1,9,4,4,4)

I want to check if a particular value is repeated consecutively and if yes, keep the first two values and assign NA to the rest of the values.

For example, in the above vector, 5 is repeated 4 times, therefore I will keep the first two 5's and make the second two 5's NA. Similarly, 4 is repeated three times, so I will keep the first two 4's and remove the third one.

In the end my vector should look like:

2,3,5,5,NA,NA,6,1,9,4,4,NA

I did this:

bad.values <- vec - binhf::shift(vec, 1, dir="right") 
bad.repeat <- bad.values == 0

vec[bad.repeat] <- NA

[1]  2  3  5 NA NA NA  6  1  9  4 NA NA

I can only get it to work to keep the first 5 and 4 (rather than first two 5's or 4',4's).

Any solutions?


Solution

  • Another option with just base R functions:

    rl <- rle(vec)
    
    i <- unlist(lapply(rl$lengths, function(l) if (l > 2) c(FALSE,FALSE,rep(TRUE, l - 2)) else rep(FALSE, l)))
    
    vec * NA^i
    

    which gives:

      [1]  2  3  5  5 NA NA  6  1  9  4  4 NA