Search code examples
rdplyrdata.table

update non-missing values based on most recent date


My data has multiple observations for each ID. At the ID level, I want to convert all values to the most recent non-missing value. I have tried using mutate, group_by(id) and which.max(year) unsuccessfully.

Data:

data <- data.frame(
  id=c(1,1,2,2,3,3,4,4,5,5),
  year=rep(c(2010, 2011), 5),
  employ=c("yes", "yes", "no", "yes", "yes", "no", NA, "yes", "no", NA))

> data
   id year employ
1   1 2010    yes
2   1 2011    yes
3   2 2010     no
4   2 2011    yes
5   3 2010    yes
6   3 2011     no
7   4 2010   <NA>
8   4 2011    yes
9   5 2010     no
10  5 2011   <NA>

Desired output:

data2 <- data.frame(
  id=c(1,1,2,2,3,3,4,4,5,5),
  year=c(2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2010, 2010),
  employ=c("yes", "yes", "yes", "yes", "no", "no","yes", "yes","no", "no"))

> data2
   id year employ
1   1 2011    yes
2   1 2011    yes
3   2 2011    yes
4   2 2011    yes
5   3 2011     no
6   3 2011     no
7   4 2011    yes
8   4 2011    yes
9   5 2010     no
10  5 2010     no

Solution

  • A data.table option

    setDT(data)[, employ := last(na.omit(employ[order(year)])), id]
    

    gives

        id year employ
     1:  1 2010    yes
     2:  1 2011    yes
     3:  2 2010    yes
     4:  2 2011    yes
     5:  3 2010     no
     6:  3 2011     no
     7:  4 2010    yes
     8:  4 2011    yes
     9:  5 2010     no
    10:  5 2011     no
    

    A dplyr way might be

    data %>%
      group_by(id) %>%
      mutate(employ = last(na.omit(employ[order(year)])))
    

    which gives

          id  year employ
       <dbl> <dbl> <chr>
     1     1  2010 yes
     2     1  2011 yes
     3     2  2010 yes
     4     2  2011 yes
     5     3  2010 no
     6     3  2011 no
     7     4  2010 yes
     8     4  2011 yes
     9     5  2010 no
    10     5  2011 no