Search code examples
rdatejoinnon-equi-join

Replace NA with values from another data frame, nearest in time, by group


I have two dataframes. Both have variables "ID" (a grouping variable), "Timepoint" (class Date), and a value variable "Values". Some example data:

df1 <- data.frame(
  ID = c(1,1,1,1,2,2,3,3),
  Timepoint = as.Date(c("2019-05-21", "2019-05-23", "2019-05-26", "2019-05-27", "2018-07-03", "2018-08-03", "2018-06-04", "2018-06-10")),
  Values = c(NA, 23, NA, NA, NA, 35, 23, 42)
  )

  ID  Timepoint Values
1  1 2019-05-21     NA
2  1 2019-05-23     23
3  1 2019-05-26     NA
4  1 2019-05-27     NA
5  2 2018-07-03     NA
6  2 2018-08-03     35
7  3 2018-06-04     23
8  3 2018-06-10     42

df2 <- data.frame(
  ID = c(1,1,2,2,3),
  Timepoint = as.Date(c("2019-05-20", "2019-05-24", "2018-09-03", "2018-06-04", "2018-06-10")),
  Values = c(11, 30, 35, 23, 42))

  ID  Timepoint Values
1  1 2019-05-20     11
2  1 2019-05-24     30
3  2 2018-09-03     35
4  2 2018-06-04     23
5  3 2018-06-10     42

In "df1", the value variable has some missing values NA. I want to replace the NA "Values" in "df1" with values from "df2", which has the same "ID" and closest date ("Timepoint").

My desired result:

  ID  Timepoint Values
1  1 2019-05-21     11
2  1 2019-05-23     23
3  1 2019-05-26     30
4  1 2019-05-27     30
5  2 2018-07-03     23
6  2 2018-08-03     35
7  3 2018-06-04     23
8  3 2018-06-10     42

Note that for ID 1, the 30 has been used twice to fill in the missing value as both were closest to 2019-05-24.

I tried joining using roll = "nearest" from data.table. But from what I understand, each value from df2 was only being used once.

I'd preferably use lubridate, tidyverse. But open to other packages if needed.


Solution

  • Here's a solution using data.table

    library(data.table)
    
    df1 <- data.table(
      ID = c(1,1,1,1,2,2,3,3),
      Timepoint = as.Date(c("2019-05-21", "2019-05-23", "2019-05-26", "2019-05-27", "2018-07-03", "2018-08-03", "2018-06-04", "2018-06-10")),
      Values = c(NA, 23, NA, NA, NA, 35, 23, 42)
    )
    df2 <- data.table(
      ID = c(1,1,2,2,3),
      Timepoint = as.Date(c("2019-05-20", "2019-05-24", "2018-09-03", "2018-06-04", "2018-06-10")),
      Values = c(11, 30, 35, 23, 42))
    
    na_rows <- which(is.na(df1$Values))
    df1[na_rows, Values := df2[ df1[na_rows,], x.Values, on = .(ID, Timepoint), roll = Inf ] ]
    print(df1)
    

    Hope this helps!

    Scratch that, missed the fact that your grouping by ID as well. In that case, add ID to the on argument, edited to be correct in the code above.