I have a panel dataset:
data <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
year = c(1,2,3,4,1,2,1,2,3),
score1 = c(90,78,92,69,86,73,82,85,91))
> data
ID year score1
1: 1 1 90
2: 1 2 78
3: 1 3 92
4: 1 4 69
5: 2 1 86
6: 2 2 73
7: 3 1 82
8: 3 2 85
9: 3 3 91
I want to place a constraint such that every ID
should have at least 3 years of observations. I tried using the following using the data.table
package:
data[data$year >= 3, ]
However this ignores the observations in years 1 & 2 for ID
's which have at least 3 years of observations. In other words, I want to only look at ID's which have at least 3 years of observations and also include the year 1 & 2 observations.
The expected output is thus:
ID year score1
1: 1 1 90
2: 1 2 78
3: 1 3 92
4: 1 4 69
5: 3 1 82
6: 3 2 85
7: 3 3 91
We may use a group by approach
library(dplyr)
data %>%
group_by(ID) %>%
filter(n_distinct(year) >= 3) %>%
ungroup
-output
# A tibble: 7 x 3
ID year score1
<dbl> <dbl> <dbl>
1 1 1 90
2 1 2 78
3 1 3 92
4 1 4 69
5 3 1 82
6 3 2 85
7 3 3 91
Or using data.table
library(data.table)
data <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]
-output
data
ID year score1
1: 1 1 90
2: 1 2 78
3: 1 3 92
4: 1 4 69
5: 3 1 82
6: 3 2 85
7: 3 3 91