I am working on R with a longitudinal database (register data so pretty large) about individuals, with several rows per ID (named "vn" in the database) and their attributes in column. My variable "observation" indicates each year of observation. Sometimes (but not in all cases) one or more years are skipped because nothing changes for the individual. I would like to add those "missing rows" to my database so that each individual has an entry for every year between their first and last observation (which aren't necessarily the same for everyone). Since individual's attributes can change over time, the row added must include the same attribute values as the previous one (for the example below, if a row is added for 2010, the individual will have a value of 3 in maritalstatus and 5584 in municipality).
Here is an overview of an individual in my database:
structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")
My current code (below) adds rows to my dataset when an observation year is missing, keeping all the information from the previous row except that now, some information is doubled, some observation years appear twice.
test<-test %>% expand(vn, municipality, yearofbirth, sex, maritalstatus, full_seq(observation,1))
I was also thinking of using rep()
but can't find a way to do what I want.
Does anyone have an idea for a code that could help me?
If we have at least one observation per year then this could be achieved via tidyr::complete
and tidyr::fill
like so:
Edit 1: If not all years are present in the dataset the approach still works by first converting observation
to factor
with the levels set to the range of years:
Edit 2: To take account of differing year ranges one has to filter after the fill. To this end I added a variable last_obs containing the last year observed for an individual. This variable can be used to filter after the fill.
<!-- language-all: lang-r -->
d <- structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")
library(dplyr)
library(tidyr)
# Add some data
d1 <- d %>%
mutate(vn = "556") %>%
filter(observation <= 2010, observation %% 2 == 0)
# Bind data
d2 <- bind_rows(d, d1)
d2 %>%
# Add year of last obs by vn
group_by(vn) %>%
mutate(last_obs = last(observation)) %>%
ungroup() %>%
# Convert to fct
mutate(observation = factor(observation, levels = 2000:2016)) %>%
# Complete and fill
tidyr::complete(vn, observation) %>%
tidyr::fill(everything()) %>%
# Convert back to numeric
mutate(observation = as.integer(as.character(observation))) %>%
# Drop obs after year of last obs
filter(as.numeric(observation) <= last_obs) %>%
# Drop helper
select(-last_obs)
#> # A tibble: 22 x 6
#> vn observation municipality yearofbirth sex maritalstatus
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 555 2000 5586 1957 2 2
#> 2 555 2001 5586 1957 2 2
#> 3 555 2002 5586 1957 2 3
#> 4 555 2003 5586 1957 2 3
#> 5 555 2004 5586 1957 2 3
#> 6 555 2005 5586 1957 2 3
#> 7 555 2006 5611 1957 2 3
#> 8 555 2007 5611 1957 2 3
#> 9 555 2008 5584 1957 2 3
#> 10 555 2009 5584 1957 2 3
#> # ... with 12 more rows