I am working with the dataset HealthIns from the 'pglm' package in R. I would like to drop all the individuals that have a different from 5 number of observations (some of them are observed only for three years). Therefore I want to create a new dataframe only with the individuals for which I have the data for the years 1,2,3,4,5. Any suggestion about how I can do it? Thank you in advance
First let's find out which ids are having data for all five years:
# Load library
library(tidyverse)
complete <- HealthIns %>%
group_by(id) %>%
count() %>%
ungroup() %>%
filter(n == 5) %>%
pull(id)
Now we can use it to filter the data:
df <- HealthIns %>%
filter(id %in% complete)
Let's check if df
is correct:
df %>%
group_by(year) %>%
count()
# A tibble: 5 x 2
# Groups: year [5]
year n
<dbl> <int>
1 1 1584
2 2 1584
3 3 1584
4 4 1584
5 5 1584
As you can see df
is having same amount of observations for each year value.