Search code examples
rdplyrdata.tablerankrun-length-encoding

Create counter for runs of TRUE among FALSE and NA, by group


I have a little nut to crack.

I have a data.frame where runs of TRUE are separated by runs of one or more FALSE or NA:

   group criterium
1      A        NA
2      A      TRUE
3      A      TRUE
4      A      TRUE
5      A     FALSE
6      A     FALSE
7      A      TRUE
8      A      TRUE
9      A     FALSE
10     A      TRUE
11     A      TRUE
12     A      TRUE
13     B        NA
14     B     FALSE
15     B      TRUE
16     B      TRUE
17     B      TRUE
18     B     FALSE

structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), criterium = c(NA, TRUE, TRUE, TRUE, 
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, NA, FALSE, 
TRUE, TRUE, TRUE, FALSE)), class = "data.frame", row.names = c(NA, 
-18L))

I want to rank the groups of TRUE in column criterium in ascending order while disregarding the FALSEand NA. The goal is to have a unique, consecutive ID for each run of TRUE, within each group.

So the result should look like:

    group criterium goal
1      A        NA   NA
2      A      TRUE    1
3      A      TRUE    1
4      A      TRUE    1
5      A     FALSE   NA
6      A     FALSE   NA
7      A      TRUE    2
8      A      TRUE    2
9      A     FALSE   NA
10     A      TRUE    3
11     A      TRUE    3
12     A      TRUE    3
13     B        NA   NA
14     B     FALSE   NA
15     B      TRUE    1
16     B      TRUE    1
17     B      TRUE    1
18     B     FALSE   NA

I'm sure there is a relatively easy way to do this, I just can't think of one. I experimented with dense_rank() and other window functions of dplyr, but to no avail.


Solution

  • Another data.table approach:

    library(data.table)
    setDT(dt)
    dt[, cr := rleid(criterium)][
        (criterium), goal := rleid(cr), by=.(group)]