Search code examples
rdplyr

Identify value in one dataset whose date is closest to a date in another dataset, using dplyr in R


I have 2 mouse datasets where each mouse has an ID and dates on two tibbles (big_df and small_df), I would like identify the value of "value" variable for each mouse ID in the big_df whose date occurs before and is closest to the date in small_df dataframe ideally using tidyverse style of programming.

I was thinking in the lines of group big_df by ID and retain the rows where the date occurs before and is closest to the date in small_df for that mouse ID. The final output should be a filtered big_df tibble with ID, date (prior to and closest to date in small_df) and value.

Sample data below.

library(tidyverse)

# big_df 
big_df <- tibble(ID = rep(1:10, each = 5), date = rep(seq(as.Date("2022-01-01"), as.Date("2022-01-05"), by = "days"), 10), value = sample(15:25, 50, replace = TRUE))

# small_df
small_df <- tibble(ID = 1:8, date = c("2022-01-05", "2022-01-04", "2022-01-03", "2022-01-02", "2022-01-01", "2022-01-05", "2022-01-04", "2022-01-03")) %>% mutate(date = as.Date(date))


Solution

  • I would have joined and filtered like this :

    left_join(big_df, small_df, by = "ID") %>% 
      filter(date.x < date.y) %>% 
      slice_max(date.x, n = 1, by = ID)
    

    Note that the result does not output a line for ID 5 since all dates are at least the date in small_df.

    And I saw the comment of deschen and found the join_by function of dplyr, thanks a lot ! You can wrap it up in 1 line with that (provided that you change the name of date in one data.frame) :

    big_df <- tibble(ID = rep(1:10, each = 5), date = rep(seq(as.Date("2022-01-01"), as.Date("2022-01-05"), by = "days"), 10), value = sample(15:25, 50, replace = TRUE))
    small_df <- tibble(ID = 1:8, date_small = c("2022-01-05", "2022-01-04", "2022-01-03", "2022-01-02", "2022-01-01", "2022-01-05", "2022-01-04", "2022-01-03")) %>% mutate(date_small = as.Date(date_small))
    left_join(small_df, big_df, by = join_by(ID, closest(date_small > date)))
    

    I replaced the date in small_df by date_small. In the output, the ID 5 is present with NA in date and value.