Search code examples
rsequenceseqrun-length-encoding

Assigning unique identifier to consecutive sequences of binomial values in R


I have a dataframe with column consisting of sequences of 0s and 1s. The 0s are not of interest but the 1s signify events occurring in a time series and the goal is to assign a unique value to each event. Simple integer values suffice. So in the code below 'x' is what I have and 'goal' is what I am after.

This seems so simple yet I don't quite know how to phrase the question on a help search...

What I have as a dataframe:

x <- c(rep(0,4),rep(1,5),rep(0,2),rep(1,4),rep(0,10),rep(1,3))

x <- data.frame(x)

What I want in the dataframe:

x$goal <- c(rep(0,4),rep(1,5),rep(0,2),rep(2,4),rep(0,10),rep(3,3))

Solution

  • This is effectively a run-length encoding, with a slight-twist (of zero-izing 0s).

    While data.table::rleid does this well, if you are not already using that package, then we'll use

    my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
    

    From here, we'll see

    x$out <- my_rleid(x$x)
    x$out <- ifelse(x$x == 0, 0L, x$out)
    x
    #    x goal out
    # 1  0    0   0
    # 2  0    0   0
    # 3  0    0   0
    # 4  0    0   0
    # 5  1    1   2
    # 6  1    1   2
    # 7  1    1   2
    # 8  1    1   2
    # 9  1    1   2
    # 10 0    0   0
    # 11 0    0   0
    # 12 1    2   4
    # 13 1    2   4
    # 14 1    2   4
    # 15 1    2   4
    # 16 0    0   0
    # 17 0    0   0
    # 18 0    0   0
    # 19 0    0   0
    # 20 0    0   0
    # 21 0    0   0
    # 22 0    0   0
    # 23 0    0   0
    # 24 0    0   0
    # 25 0    0   0
    # 26 1    3   6
    # 27 1    3   6
    # 28 1    3   6
    

    which is pretty close. If you need consecutive numbers (no gaps like above), then

    x$out <- match(x$out, sort(unique(x$out))) - (0 %in% x$out)
    x
    #    x goal out
    # 1  0    0   0
    # 2  0    0   0
    # 3  0    0   0
    # 4  0    0   0
    # 5  1    1   1
    # 6  1    1   1
    # 7  1    1   1
    # 8  1    1   1
    # 9  1    1   1
    # 10 0    0   0
    # 11 0    0   0
    # 12 1    2   2
    # 13 1    2   2
    # 14 1    2   2
    # 15 1    2   2
    # 16 0    0   0
    # 17 0    0   0
    # 18 0    0   0
    # 19 0    0   0
    # 20 0    0   0
    # 21 0    0   0
    # 22 0    0   0
    # 23 0    0   0
    # 24 0    0   0
    # 25 0    0   0
    # 26 1    3   3
    # 27 1    3   3
    # 28 1    3   3
    

    The reason I chose to use - (0 %in% x$out) instead of a hard-coded 1 is that I wanted to guard against the possibility of there being no 0s in the data. Put differently, that (0 %in% x$out) resolves to FALSE or TRUE, which when subtracted from integers, is coerced to 0L or 1L, respectively. The reason I need this: if there is a 0 in $out, then match will effectively be match(0, 0:6) which will return 1. We want the x == 0 matches to be 0L, so we have to subtract one. Since the second argument (from sort(unique(.))) is always either 0-based (as here) or 1-based (no zeroes present in x$x), it's an easy adjustment.

    If you are certain that this cannot be the case, and you don't like the - (.) I appended to match(.), then you can change that to match(.) - 1L.