Search code examples
rdplyrseq

R label sequence by changing factor values


I have the following data:

Account date type
1 2021-08-31 0
1 2021-09-23 0
1 2021-09-30 5
1 2021-10-30 0
1 2021-12-29 0
1 2022-01-31 8
1 2022-02-02 0

I need to find the minimum date of each individual transition.

group_by(Account, type) %>%
summarise(first_appearance = min(date))

returns

Account date type
1 2021-08-31 0
1 2021-09-30 5
1 2022-01-31 8

How can I group by each SHIFT in type?

My initial thoughts are to generate some sort of sequence along the factors and concatenate to have a unique grouping variable, but how would this be done?

Account date type order type_order
1 2021-08-31 0 A 0A
1 2021-09-23 0 A 0A
1 2021-09-30 5 A 5A
1 2021-10-30 0 B 0B
1 2021-12-29 0 B 0B
1 2022-01-31 8 A 8A
1 2022-02-02 0 C 0C

Desired output would be:

group_by(Account, type_order) %>%
summarise(first_appearance = min(date))
Account date type order type_order
1 2021-08-31 0 A 0A
1 2021-09-30 5 A 5A
1 2021-10-30 0 B 0B
1 2022-01-31 8 A 8A
1 2022-02-02 0 C 0C

Solution

  • Maybe use rleid from data.table to assign groups when there are differences in type from row to row.

    library(tidyverse)
    library(data.table)
    
    df %>%
      group_by(Account, grp = rleid(type), type, order) %>%
      summarise(first_appearance = min(date))
    

    Output

      Account   grp  type order first_appearance
        <int> <int> <int> <chr> <chr>           
    1       1     1     0 A     2021-08-31      
    2       1     2     5 A     2021-09-30      
    3       1     3     0 B     2021-10-30      
    4       1     4     8 A     2022-01-31      
    5       1     5     0 C     2022-02-02