Search code examples
rdata.tablepanel-datalongitudinal

Remove discontinuous time points in longitudinal data


I have a firm-year longitudinal data but the year is not continuous for some firms, for example

library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1999,2000,2001,1995,1997,2008,2009,2010))

For each firm, I want to keep observations in the most recent continuous years and remove other observations. For example, Firm 1 has five-year observations in (1990, 1991, 1999, 2000, 2001) and I want to keep (1999, 2000, 2001)

data.table(firm_id=c(rep(1,3),rep(2,3)),year=c(1999,2000,2001,2008,2009,2010))

I can think of some awkward approaches to solve this issue but I am wondering if there is an easy way to solve it.

Enlighted by the comments, I am also wondering if there is any way to keep the longest continuous vector block of years. For example,

library(data.table)
dt = data.table(firm_id=c(rep(1,5),rep(2,5)),year=c(1990,1991,1992,2000,2001,1995,1997,2008,2009,2010))

The result would be

enter image description here


Solution

  • library(data.table)
    DT2 <- setorder(dt, firm_id, year)[
     ,d := cumsum(c(TRUE, diff(year) > 1)), by = .(firm_id) ][
     ,n := .N, by = .(firm_id, d) ]
    
    DT2
    #     firm_id  year     d     n
    #       <num> <num> <int> <int>
    #  1:       1  1990     1     3
    #  2:       1  1991     1     3
    #  3:       1  1992     1     3
    #  4:       1  2000     2     2
    #  5:       1  2001     2     2
    #  6:       2  1995     1     1
    #  7:       2  1997     2     1
    #  8:       2  2008     3     3
    #  9:       2  2009     3     3
    # 10:       2  2010     3     3
    

    From here, if you want runs of 3 consecutive years or more, then

    DT2[ (n > 2), ]
    

    If you want the longest run for each firm_id, then

    DT2[, .SD[n == max(n),], by = .(firm_id) ]