Search code examples
rmissing-dataimputation

Is there an R function for imputing missing year values, consecutively, by group?


My dataframe looks like:

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

I would like to be able to impute the missing year values by ID, with the following desired results:

wanted_df <- data.frame(ID=c("A", "A", "A", "A", 
                             "B", "B", "B", "B",
                             "C", "C", "C", "C",
                             "D", "D", "D", "D"),
                       grade=c("KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03"),
                       year=c(2002, 2003, 2004, 2005,
                              2007, 2008, 2009, 2010,
                              2004, 2005, 2006, 2007,
                              2009, 2010, 2011, 2012))

I have attempted to impute the values using:

  • lag() and lead() functions
  • Joining to a dataframe consisting of years

Neither have worked. Any help would be greatly appreciated. Thank you.


Solution

  • We may use na_interpolate/na_extrapolate

    library(dplyr)
    # remotes::install_github("skgrange/threadr")
    library(threadr)
    df %>% 
       group_by(ID) %>% 
       mutate(year = na_extrapolate(na_interpolate(year))) %>%
       ungroup
    

    -output

    # A tibble: 16 × 3
       ID    grade  year
       <chr> <chr> <dbl>
     1 A     KG    2002 
     2 A     01    2003 
     3 A     02    2004 
     4 A     03    2005 
     5 B     KG    2007 
     6 B     01    2008 
     7 B     02    2009 
     8 B     03    2010 
     9 C     KG    2004.
    10 C     01    2005 
    11 C     02    2006 
    12 C     03    2007 
    13 D     KG    2009 
    14 D     01    2010 
    15 D     02    2011 
    16 D     03    2012.