Search code examples
rdata.tableconstraintspanel-data

Constraint on panel data to remove subjects using data.table


I have a panel dataset:

data <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
                   year = c(1,2,3,4,1,2,1,2,3),
                   score1 = c(90,78,92,69,86,73,82,85,91))

> data
   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  2    1     86
6:  2    2     73
7:  3    1     82
8:  3    2     85
9:  3    3     91

I want to place a constraint such that every ID should have at least 3 years of observations. I tried using the following using the data.table package:

data[data$year >= 3, ]

However this ignores the observations in years 1 & 2 for ID's which have at least 3 years of observations. In other words, I want to only look at ID's which have at least 3 years of observations and also include the year 1 & 2 observations.

The expected output is thus:

   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91

Solution

  • We may use a group by approach

    library(dplyr)
    data %>%
        group_by(ID) %>% 
        filter(n_distinct(year) >= 3) %>%
        ungroup
    

    -output

    # A tibble: 7 x 3
         ID  year score1
      <dbl> <dbl>  <dbl>
    1     1     1     90
    2     1     2     78
    3     1     3     92
    4     1     4     69
    5     3     1     82
    6     3     2     85
    7     3     3     91
    

    Or using data.table

    library(data.table)
    data <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]
    

    -output

    data
        ID year score1
    1:  1    1     90
    2:  1    2     78
    3:  1    3     92
    4:  1    4     69
    5:  3    1     82
    6:  3    2     85
    7:  3    3     91