Search code examples
rdatedplyrtidyrgaps-and-islands

Is there a way to split a group of dates by breaks in the sequence of dates, in R/dplyr? (A bit like Gaps and Islands)


I am trying to solve a problem in R which in SQL is often called gaps and islands. Below is a mockup of the data I am using:

library(tidyverse)
library(zoo)
df <- tibble('id' = c('A','A','A', 'B', 'B'), 
             'year' = as.yearqtr(c("01/01/2001","01/04/2001","01/10/2002", "01/09/2001","01/12/2001"),"%d/%m/%Y"), 
             'var1' = c(1,3,2,5,3), 
             'var2' = c(10,20,5,8,9))

ie:

  id    year       var1  var2
  <chr> <yearqtr> <dbl> <dbl>
1 A     2001 Q1       1    10
2 A     2001 Q2       3    20
3 A     2002 Q4       2     5
4 B     2001 Q3       5     8
5 B     2001 Q4       3     9

I want to group the observations by id, and then add a tag to distinguish individual sequential blocks of data. In my Case, sequential here means quarter to quarter. But in another file I am using, it is year to year. So the expected output would be

  id    year       var1  var2 group
  <chr> <yearqtr> <dbl> <dbl> <dbl>
1 A     2001 Q1       1    10     1
2 A     2001 Q2       3    20     1
3 A     2002 Q4       2     5     2
4 B     2001 Q3       5     8     1
5 B     2001 Q4       3     9     1

Because the observations in lines 1 and 2, belong to group 1 for ID A, line 3 belongs to group 2 for ID A, and lines 4 and 5 are both part of group 1 for ID B

I've looked at the zoo package but I can't find an answer, and most of the other SO posts seem to be about SQL, so I'm stuck

EDIT With the help of the answer below, I found this was the correct solution for me:

df %>%group_by(id) %>% 
  mutate(group = cumsum(c(TRUE, 4* diff(year) > 1)))

Solution

  • I think this is simply cumsum(diff(year) > 1) per id:

    library(dplyr)
    df %>%
      mutate(group = cumsum(c(TRUE, diff(year) > 1)), .by = id)
    # # A tibble: 5 × 5
    #   id    year       var1  var2 group
    #   <chr> <yearqtr> <dbl> <dbl> <int>
    # 1 A     2001 Q1       1    10     1
    # 2 A     2001 Q2       3    20     1
    # 3 A     2002 Q4       2     5     2
    # 4 B     2001 Q3       5     8     1
    # 5 B     2001 Q4       3     9     1
    

    Base alternative:

    df$group <- ave(as.numeric(df$year), df$id,
                    FUN = function(z) cumsum(c(TRUE, diff(z) > 1)))
    

    where as.numeric(.) is needed because since ave will retain the class of the differenced column ("yearqtr"), which in this case is not that useful.