Search code examples
rpanel

How to drop the unbalanced groups from panel data in R


I am working with the dataset HealthIns from the 'pglm' package in R. I would like to drop all the individuals that have a different from 5 number of observations (some of them are observed only for three years). Therefore I want to create a new dataframe only with the individuals for which I have the data for the years 1,2,3,4,5. Any suggestion about how I can do it? Thank you in advance


Solution

  • First let's find out which ids are having data for all five years:

    # Load library
    library(tidyverse) 
    
    complete <- HealthIns %>% 
      group_by(id) %>% 
      count() %>% 
      ungroup() %>% 
      filter(n == 5) %>% 
      pull(id)
    

    Now we can use it to filter the data:

    df <- HealthIns %>% 
      filter(id %in% complete)
    

    Let's check if df is correct:

    df %>% 
      group_by(year) %>% 
      count()
    
    # A tibble: 5 x 2
    # Groups:   year [5]
       year     n
      <dbl> <int>
    1     1  1584
    2     2  1584
    3     3  1584
    4     4  1584
    5     5  1584
    

    As you can see df is having same amount of observations for each year value.