Search code examples
rdatesequencerowssurvival-analysis

R: Is there a way to insert rows that complete a year sequence?


I am working on R with a longitudinal database (register data so pretty large) about individuals, with several rows per ID (named "vn" in the database) and their attributes in column. My variable "observation" indicates each year of observation. Sometimes (but not in all cases) one or more years are skipped because nothing changes for the individual. I would like to add those "missing rows" to my database so that each individual has an entry for every year between their first and last observation (which aren't necessarily the same for everyone). Since individual's attributes can change over time, the row added must include the same attribute values as the previous one (for the example below, if a row is added for 2010, the individual will have a value of 3 in maritalstatus and 5584 in municipality).

Here is an overview of an individual in my database:

structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")

My current code (below) adds rows to my dataset when an observation year is missing, keeping all the information from the previous row except that now, some information is doubled, some observation years appear twice.

test<-test %>% expand(vn, municipality, yearofbirth, sex, maritalstatus, full_seq(observation,1))

I was also thinking of using rep() but can't find a way to do what I want.

Does anyone have an idea for a code that could help me?


Solution

  • If we have at least one observation per year then this could be achieved via tidyr::complete and tidyr::fill like so:

    Edit 1: If not all years are present in the dataset the approach still works by first converting observation to factor with the levels set to the range of years:

    Edit 2: To take account of differing year ranges one has to filter after the fill. To this end I added a variable last_obs containing the last year observed for an individual. This variable can be used to filter after the fill.

    <!-- language-all: lang-r -->
    
    
    d <- structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")
    
    library(dplyr)
    library(tidyr)
    
    # Add some data
    d1 <- d %>%
      mutate(vn = "556") %>% 
      filter(observation <= 2010, observation %% 2 == 0)
    
    # Bind data
    d2 <- bind_rows(d, d1)
    
    d2 %>%
      # Add year of last obs by vn
      group_by(vn) %>% 
      mutate(last_obs = last(observation)) %>% 
      ungroup() %>% 
      # Convert to fct
      mutate(observation = factor(observation, levels = 2000:2016)) %>%
      # Complete and fill
      tidyr::complete(vn, observation) %>% 
      tidyr::fill(everything()) %>%
      # Convert back to numeric
      mutate(observation = as.integer(as.character(observation))) %>%
      # Drop obs after year of last obs
      filter(as.numeric(observation) <= last_obs) %>% 
      # Drop helper
      select(-last_obs)
    #> # A tibble: 22 x 6
    #>    vn    observation municipality yearofbirth   sex maritalstatus
    #>    <chr>       <int>        <dbl>       <dbl> <dbl>         <dbl>
    #>  1 555          2000         5586        1957     2             2
    #>  2 555          2001         5586        1957     2             2
    #>  3 555          2002         5586        1957     2             3
    #>  4 555          2003         5586        1957     2             3
    #>  5 555          2004         5586        1957     2             3
    #>  6 555          2005         5586        1957     2             3
    #>  7 555          2006         5611        1957     2             3
    #>  8 555          2007         5611        1957     2             3
    #>  9 555          2008         5584        1957     2             3
    #> 10 555          2009         5584        1957     2             3
    #> # ... with 12 more rows