I have a dataframe with column consisting of sequences of 0s and 1s. The 0s are not of interest but the 1s signify events occurring in a time series and the goal is to assign a unique value to each event. Simple integer values suffice. So in the code below 'x' is what I have and 'goal' is what I am after.
This seems so simple yet I don't quite know how to phrase the question on a help search...
What I have as a dataframe:
x <- c(rep(0,4),rep(1,5),rep(0,2),rep(1,4),rep(0,10),rep(1,3))
x <- data.frame(x)
What I want in the dataframe:
x$goal <- c(rep(0,4),rep(1,5),rep(0,2),rep(2,4),rep(0,10),rep(3,3))
This is effectively a run-length encoding, with a slight-twist (of zero-izing 0
s).
While data.table::rleid
does this well, if you are not already using that package, then we'll use
my_rleid <- function(x) { yy <- rle(x); rep(seq_along(yy$lengths), yy$lengths); }
From here, we'll see
x$out <- my_rleid(x$x)
x$out <- ifelse(x$x == 0, 0L, x$out)
x
# x goal out
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 1 1 2
# 6 1 1 2
# 7 1 1 2
# 8 1 1 2
# 9 1 1 2
# 10 0 0 0
# 11 0 0 0
# 12 1 2 4
# 13 1 2 4
# 14 1 2 4
# 15 1 2 4
# 16 0 0 0
# 17 0 0 0
# 18 0 0 0
# 19 0 0 0
# 20 0 0 0
# 21 0 0 0
# 22 0 0 0
# 23 0 0 0
# 24 0 0 0
# 25 0 0 0
# 26 1 3 6
# 27 1 3 6
# 28 1 3 6
which is pretty close. If you need consecutive numbers (no gaps like above), then
x$out <- match(x$out, sort(unique(x$out))) - (0 %in% x$out)
x
# x goal out
# 1 0 0 0
# 2 0 0 0
# 3 0 0 0
# 4 0 0 0
# 5 1 1 1
# 6 1 1 1
# 7 1 1 1
# 8 1 1 1
# 9 1 1 1
# 10 0 0 0
# 11 0 0 0
# 12 1 2 2
# 13 1 2 2
# 14 1 2 2
# 15 1 2 2
# 16 0 0 0
# 17 0 0 0
# 18 0 0 0
# 19 0 0 0
# 20 0 0 0
# 21 0 0 0
# 22 0 0 0
# 23 0 0 0
# 24 0 0 0
# 25 0 0 0
# 26 1 3 3
# 27 1 3 3
# 28 1 3 3
The reason I chose to use - (0 %in% x$out)
instead of a hard-coded 1
is that I wanted to guard against the possibility of there being no 0s in the data. Put differently, that (0 %in% x$out)
resolves to FALSE
or TRUE
, which when subtracted from integer
s, is coerced to 0L
or 1L
, respectively. The reason I need this: if there is a 0
in $out
, then match
will effectively be match(0, 0:6)
which will return 1
. We want the x == 0
matches to be 0L
, so we have to subtract one. Since the second argument (from sort(unique(.))
) is always either 0-based (as here) or 1-based (no zeroes present in x$x
), it's an easy adjustment.
If you are certain that this cannot be the case, and you don't like the - (.)
I appended to match(.)
, then you can change that to match(.) - 1L
.