Search code examples
rnamissing-datanearest-neighbor

Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour


Here I have a data that looks like this:

year <- c(2000,2001,2002,2003,2005,2006,2007,2008,2009,2010)
x <- c(1,2,3,NA,5,NA,NA,NA,9,10)
dat <- data.frame(year, x)
  1. I want to replace NA with the nearest neighbor according to the year variable.

For example, The fourth place of the data (the first NA) takes the value from its left neighbor rather than its right neighbor because its year "2003" is closer to "2002" instead of "2005"

  1. I want to leave the NA there when it does not have nearest nonNA neighbor.

For example, the seventh place of the data (the third NA) will still be NA because it does not have non-NA neighbor.

After imputing, the resulting x should be 1, 2, 3, 3, 5, 5, NA, 9, 9, 10


Solution

  • One option would be to make use of case_when from tidyverse. Essentially, if the previous row has a closer year and is not NA, then return x from that row. If not, then choose the row below. Or if the year is closer above but there is an NA, then return the row below. Then, same for if the row below has a closer year, but has an NA, then return the row above. If a row does not have an NA, then just return x.

    library(tidyverse)
    
    dat %>%
      mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
                           is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
                           is.na(x) & is.na(lag(x)) ~ lead(x),
                           is.na(x) & is.na(lead(x)) ~ lag(x),
                           TRUE ~ x))
    

    Output

       year  x
    1  2000  1
    2  2001  2
    3  2002  3
    4  2003  3
    5  2005  5
    6  2006  5
    7  2007 NA
    8  2008  9
    9  2009  9
    10 2010 10