Search code examples
rdata.tablediffcumsum

using diff and cumsum in R


I have been trying to use a solution provided in this question, however it is not working for some reason.

Basically, i want to find all IDs that have at least 4 consecutive years and only keep rows that belong to those IDs.

Any ideas?

# example data
ID <- c(rep("A", 5), rep("B", 6), rep("C", 2), rep("D", 3), rep("E", 4))
year <- as.numeric(c(rep(c(2012, 2013, 2014, 2015), 4), 2012, 2013, 2015, 2016))
mydata <- cbind(ID, year)
mydata <- as.data.table(mydata)
mydata$year <- as.numeric(mydata$year)

# provided solution
mydata2 <- setDT(mydata)[, grp := cumsum(c(0, diff(year)) > 1), by = ID
                         ][, if (.N > 4) .SD, by = grp][, grp := NULL][]

Solution

  • Here's how you could do that with data.table. Basically, you want to make sure to have the table sorted first. Second, I'm using diff(year)==1. We want consecutive years, so it's important that the diff is exactly equal to 1 to prevent jumps in years. Finally, I'm only keeping IDs that have a max(cumsum) of 3 or more (3 not 4 since we are starting to count from 0).

    setorder(mydata)
    mydata2 <- mydata[, grp := cumsum(c(0, diff(year)==1)), by = ID]
    mydata2[,max_grp := max(grp), by=ID]
    mydata2[max_grp>=3]
    
        ID year grp max_grp
     1:  A 2012   0       3
     2:  A 2012   0       3
     3:  A 2013   1       3
     4:  A 2014   2       3
     5:  A 2015   3       3
     6:  B 2012   0       3
     7:  B 2013   1       3
     8:  B 2013   1       3
     9:  B 2014   2       3
    10:  B 2014   2       3
    11:  B 2015   3       3