I am trying to solve a problem in R which in SQL is often called gaps and islands. Below is a mockup of the data I am using:
library(tidyverse)
library(zoo)
df <- tibble('id' = c('A','A','A', 'B', 'B'),
'year' = as.yearqtr(c("01/01/2001","01/04/2001","01/10/2002", "01/09/2001","01/12/2001"),"%d/%m/%Y"),
'var1' = c(1,3,2,5,3),
'var2' = c(10,20,5,8,9))
ie:
id year var1 var2
<chr> <yearqtr> <dbl> <dbl>
1 A 2001 Q1 1 10
2 A 2001 Q2 3 20
3 A 2002 Q4 2 5
4 B 2001 Q3 5 8
5 B 2001 Q4 3 9
I want to group the observations by id, and then add a tag to distinguish individual sequential blocks of data. In my Case, sequential here means quarter to quarter. But in another file I am using, it is year to year. So the expected output would be
id year var1 var2 group
<chr> <yearqtr> <dbl> <dbl> <dbl>
1 A 2001 Q1 1 10 1
2 A 2001 Q2 3 20 1
3 A 2002 Q4 2 5 2
4 B 2001 Q3 5 8 1
5 B 2001 Q4 3 9 1
Because the observations in lines 1 and 2, belong to group 1 for ID A, line 3 belongs to group 2 for ID A, and lines 4 and 5 are both part of group 1 for ID B
I've looked at the zoo package but I can't find an answer, and most of the other SO posts seem to be about SQL, so I'm stuck
EDIT With the help of the answer below, I found this was the correct solution for me:
df %>%group_by(id) %>%
mutate(group = cumsum(c(TRUE, 4* diff(year) > 1)))
I think this is simply cumsum(diff(year) > 1)
per id
:
library(dplyr)
df %>%
mutate(group = cumsum(c(TRUE, diff(year) > 1)), .by = id)
# # A tibble: 5 × 5
# id year var1 var2 group
# <chr> <yearqtr> <dbl> <dbl> <int>
# 1 A 2001 Q1 1 10 1
# 2 A 2001 Q2 3 20 1
# 3 A 2002 Q4 2 5 2
# 4 B 2001 Q3 5 8 1
# 5 B 2001 Q4 3 9 1
Base alternative:
df$group <- ave(as.numeric(df$year), df$id,
FUN = function(z) cumsum(c(TRUE, diff(z) > 1)))
where as.numeric(.)
is needed because since ave
will retain the class of the differenced column ("yearqtr"
), which in this case is not that useful.