I have a data table containing thousands of firms that can be identified by a unique ID. It is long format data and each firm is supposed to appear twice in different years (cross-sectional time-series over two years).
However, not all firms appear in both years and I am trying to create a balanced long format panel in which only firms remain that appear in both years. How do I accomplish this?
This is an example data table to illustrate the issue:
example <- matrix(c(1,1,2,3,3,2013,2016,2013,2013,2016), ncol=2)
colnames(example) <- c('id', 'year')
example.table <- data.table(example)
example.table
id year
1: 1 2013
2: 1 2016
3: 2 2013
4: 3 2013
5: 3 2016
In the example, I need a code/function that lets me exclude the row of the firm with the id "2", because it has no match in 2016. In other words: I need a code/function that compares each row with the previous & subsequent row and excludes it, if there is no match in the id column.
I have invested many hours, but appear to have reached the limits of my R knowledge and would appreciate any support. Thanks!
Using dplyr
as below:
library(dplyr)
example.table %>%
group_by(id) %>%
filter(n() > 1)
# A tibble: 4 x 2
# Groups: id [2]
id year
<dbl> <dbl>
1 1 2013
2 1 2016
3 3 2013
4 3 2016