Assigning unique identifier to consecutive sequences of binomial values in R

I have a dataframe with column consisting of sequences of 0s and 1s. The 0s are not of interest but the 1s signify events occurring in a time series and the goal is to assign a unique value to each event. Simple integer values suffice. So in the code below 'x' is what I have and 'goal' is what I am after.

This seems so simple yet I don't quite know how to phrase the question on a help search...

What I have as a dataframe:

x <- c(rep(0,4),rep(1,5),rep(0,2),rep(1,4),rep(0,10),rep(1,3))

x <- data.frame(x)

What I want in the dataframe:

x$goal <- c(rep(0,4),rep(1,5),rep(0,2),rep(2,4),rep(0,10),rep(3,3))

Solution

This is effectively a run-length encoding, with a slight-twist (of zero-izing 0s).

While data.table::rleid does this well, if you are not already using that package, then we'll use

my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }

From here, we'll see

x$out <- my_rleid(x$x)
x$out <- ifelse(x$x == 0, 0L, x$out)
x
#    x goal out
# 1  0    0   0
# 2  0    0   0
# 3  0    0   0
# 4  0    0   0
# 5  1    1   2
# 6  1    1   2
# 7  1    1   2
# 8  1    1   2
# 9  1    1   2
# 10 0    0   0
# 11 0    0   0
# 12 1    2   4
# 13 1    2   4
# 14 1    2   4
# 15 1    2   4
# 16 0    0   0
# 17 0    0   0
# 18 0    0   0
# 19 0    0   0
# 20 0    0   0
# 21 0    0   0
# 22 0    0   0
# 23 0    0   0
# 24 0    0   0
# 25 0    0   0
# 26 1    3   6
# 27 1    3   6
# 28 1    3   6

which is pretty close. If you need consecutive numbers (no gaps like above), then

x$out <- match(x$out, sort(unique(x$out))) - (0 %in% x$out)
x
#    x goal out
# 1  0    0   0
# 2  0    0   0
# 3  0    0   0
# 4  0    0   0
# 5  1    1   1
# 6  1    1   1
# 7  1    1   1
# 8  1    1   1
# 9  1    1   1
# 10 0    0   0
# 11 0    0   0
# 12 1    2   2
# 13 1    2   2
# 14 1    2   2
# 15 1    2   2
# 16 0    0   0
# 17 0    0   0
# 18 0    0   0
# 19 0    0   0
# 20 0    0   0
# 21 0    0   0
# 22 0    0   0
# 23 0    0   0
# 24 0    0   0
# 25 0    0   0
# 26 1    3   3
# 27 1    3   3
# 28 1    3   3

The reason I chose to use - (0 %in% x$out) instead of a hard-coded 1 is that I wanted to guard against the possibility of there being no 0s in the data. Put differently, that (0 %in% x$out) resolves to FALSE or TRUE, which when subtracted from integers, is coerced to 0L or 1L, respectively. The reason I need this: if there is a 0 in $out, then match will effectively be match(0, 0:6) which will return 1. We want the x == 0 matches to be 0L, so we have to subtract one. Since the second argument (from sort(unique(.))) is always either 0-based (as here) or 1-based (no zeroes present in x$x), it's an easy adjustment.

If you are certain that this cannot be the case, and you don't like the - (.) I appended to match(.), then you can change that to match(.) - 1L.