Search code examples
rdataframegroup-byrun-length-encoding

Grouping of R dataframe by connected values


I didn't find a solution for this common grouping problem in R:

This is my original dataset

ID  State
1   A
2   A
3   B
4   B
5   B
6   A
7   A
8   A
9   C
10  C

This should be my grouped resulting dataset

State   min(ID) max(ID)
A       1       2
B       3       5
A       6       8
C       9       10

So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?


Solution

  • You could try:

    library(dplyr)
    df %>%
      mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
      group_by(rleid) %>%
      summarise(State = first(State), min = min(ID), max = max(ID)) %>%
      select(-rleid)
    

    Or as per mentioned by @alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:

    df %>% 
      group_by(State, rleid = data.table::rleid(State)) %>% 
      summarise_all(funs(min, max)) %>% 
      select(-rleid)
    

    Which gives:

    ## A tibble: 4 × 3
    #   State   min   max
    #  <fctr> <int> <int>
    #1      A     1     2
    #2      B     3     5
    #3      A     6     8
    #4      C     9    10