Search code examples
rsequencerun-length-encoding

Add index to runs of equal values, accounting for NA


This an example of my data:

df <- data.frame(dyad = c("a", "a", "b", NA, "c", NA, "c", "b"))
df
#   dyad
# 1    a
# 2    a
# 3    b
# 4 <NA>
# 5    c
# 6 <NA>
# 7    c
# 8    b

I want to create an index for runs consecutive runs of dyad that are the same.

Note 1: dyad might be repeated throught the dataframe, but should always have a new unique label if not consecutive to the previous rows in which dyad is the same. E.g. the "b" on row 3 and 8 should have different id.

Note 2: identical dyad before and after NA should have different id. E.g. the "c" before and after the last NA should have a different id.

Thus, the expected result is:

#   dyad event
# 1    a     1
# 2    a     1
# 3    b     2
# 4 <NA>    NA
# 5    c     3
# 6 <NA>    NA
# 7    c     4
# 8    b     5

Any insight in how to make it work or advice are welcome!


Solution

  • Using rleid from data.table and cumsum.

    library(data.table)
    
    df$event <- rleid(df$dyad) - cumsum(is.na(df$dyad))
    df$event[is.na(df$dyad)] <- NA
    df
    
    #  dyad event
    #1    a     1
    #2    a     1
    #3    b     2
    #4 <NA>    NA
    #5    c     3
    #6 <NA>    NA
    #7    c     4
    #8    b     5
    

    Well the above solution does not work when you have consecutive NA's, in that case we can use :

    x = c("a", NA, NA, "a", "b", "b", "c", NA)
    y <- cumsum(!duplicated(rleid(x)) & !is.na(x))
    y[is.na(x)] <- NA
    y
    #[1]  1 NA NA  2  3  3  4 NA