Search code examples
rformattingpanel-datarowdeleting

How to delete rows with a unique ID from a panel data frame in R?


I have a data table containing thousands of firms that can be identified by a unique ID. It is long format data and each firm is supposed to appear twice in different years (cross-sectional time-series over two years).

However, not all firms appear in both years and I am trying to create a balanced long format panel in which only firms remain that appear in both years. How do I accomplish this?

This is an example data table to illustrate the issue:

example <- matrix(c(1,1,2,3,3,2013,2016,2013,2013,2016), ncol=2)
colnames(example) <- c('id', 'year')
example.table <- data.table(example)
example.table

   id year
1:  1 2013
2:  1 2016
3:  2 2013
4:  3 2013
5:  3 2016

In the example, I need a code/function that lets me exclude the row of the firm with the id "2", because it has no match in 2016. In other words: I need a code/function that compares each row with the previous & subsequent row and excludes it, if there is no match in the id column.

I have invested many hours, but appear to have reached the limits of my R knowledge and would appreciate any support. Thanks!


Solution

  • Using dplyr as below:

    library(dplyr)
    example.table %>%
      group_by(id) %>%
      filter(n() > 1)
    # A tibble: 4 x 2
    # Groups:   id [2]
         id  year
      <dbl> <dbl>
    1     1  2013
    2     1  2016
    3     3  2013
    4     3  2016