Search code examples
rdatabasedataframegroup-by

Collapse rows with consecutive TRUE values into a single row in R


Consider the following dataframe.

df <- data.frame(ID = c(1, 2, 3, 4, 5, 6, 7),
                 Value = c(FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE))

I want to transform df into df' such that every sequence of TRUE values in Value is collapsed into a single row, with their respective IDs being now the ID of the first True and the ID of the last. For the data frame above, we should get

ID    |   Value
---------------
1         FALSE
(2, 3)    TRUE
4         FALSE
(5, 8)    TRUE

I have no particular interest in the resulting type of ID (factor, string, whatever). How can I achieve this? I have tried mutating the data frame using dplyr's mutate function, but I could not find the correct function to use inside mutate. I am aware that if I were able to assign a different group to each sequence of True values (in other words, if I could apply the intermediate transformation from

ID    |   Value
---------------
1         FALSE
2         TRUE
3         TRUE
4         FALSE
5         TRUE
6         TRUE
7         TRUE
8         TRUE

to

ID   |   Value    | Group
------------------------
1         FALSE     0
2         TRUE      1
3         TRUE      1
4         FALSE     0
5         TRUE      2
6         TRUE      2
7         TRUE      2
8         TRUE      2

the task would be much easier. But I haven't find a way to group by identifying sequences of values! Any help is appreciated. Thanks.


Solution

  • I would calculate the groups by ticking the counter is the value is false or if the lagged value is false.

    library(tidyverse)
    
    df |>
         mutate(grp = cumsum(!Value | !lag(Value, default = TRUE))) |>
         group_by(grp)|>
         summarise(ID = ifelse(n() == 1, as.character(ID),
                               paste0("(", first(ID), ", ", last(ID), ")")),
                   Value = first(Value))
    

    EDIT Here is a step by step for the logic:

    First, we evaluate !Value

    df |> mutate(grp = !Value)
    #>   ID Value   grp
    #> 1  1 FALSE  TRUE
    #> 2  2  TRUE FALSE
    #> 3  3  TRUE FALSE
    #> 4  4 FALSE  TRUE
    #> 5  5  TRUE FALSE
    #> 6  6  TRUE FALSE
    #> 7  7  TRUE FALSE
    

    For ease of visuals (bools can be tough to look at like this, we can wrap the expression in +() to evaluate as a binary.

    df |> mutate(grp = `+`(!Value))
    #>   ID Value grp
    #> 1  1 FALSE   1
    #> 2  2  TRUE   0
    #> 3  3  TRUE   0
    #> 4  4 FALSE   1
    #> 5  5  TRUE   0
    #> 6  6  TRUE   0
    #> 7  7  TRUE   0
    

    From this we can see that everywhere that has Value = FALSE gets a 1 and Value = TRUE gets a 0. we can calculate the cumulative sum (i.e., cumsum to count every time we encounter a 1 (i.e., a Value = FALSE).

    df |> mutate(grp = cumsum(`+`(!Value)))
    #>   ID Value grp
    #> 1  1 FALSE   1
    #> 2  2  TRUE   1
    #> 3  3  TRUE   1
    #> 4  4 FALSE   2
    #> 5  5  TRUE   2
    #> 6  6  TRUE   2
    #> 7  7  TRUE   2
    

    Here we see that every time Value = False grp goes up by one. We can also notice that our logic does not encapsulate the grouping correctly. The Value=TRUE get lumped in with the previous Value=FALSE, but we want them in their own group. We can add an additional logic to the expression to take care of that. For example, if the last value was false, then we also want to tick the group up one number.

    df |> mutate(grp = cumsum(`+`(!Value| !lag(Value, default = TRUE))))
    #>   ID Value grp
    #> 1  1 FALSE   1
    #> 2  2  TRUE   2
    #> 3  3  TRUE   2
    #> 4  4 FALSE   3
    #> 5  5  TRUE   4
    #> 6  6  TRUE   4
    #> 7  7  TRUE   4
    

    Notice now, that the groups are exactly as we want them. We want every false group to be its own thing and all consevutive TRUE's to be their own thing. Hopefully this helps.