Search code examples
rdateinner-joindplyr

Merge data sets based on id and date-R


I am trying add information from a second data set to my first based on ID and dates. If the ID matches and 'Date' is between 'start' and 'end', I want to add the value for colour to df1.

    df1
    ID Date 
    1  3/31/2017
    2  2/11/2016
    2  4/10/2016 
    3  5/15/2015

   df2
   ID  start      end        colour
    1   1/1/2000 3/31/2011    blue
    1   4/1/2011  6/4/2012    purple
    1   6/5/2012  3/31/2017   blue
    2   5/1/2014  3/31/2017   red
    3   1/12/2012  2/12/2014  purple

To get a result like this:

    dat
    ID Date        colour
    1  3/31/2017   blue
    2  2/11/2016   red
    2  4/10/2016   red
    3  5/15/2015   NA 

Which can be created with the code here:

library(lubridate)
df1 <- tibble(ID = c(1,2,2,3), Date = mdy(c("3/31/2017","2/11/2016","4/10/2016","5/15/2015")))
df2 <- tibble(ID = c(1,1,1,2,3), start = mdy(c("1/1/2000","4/1/2011","6/5/2012","5/1/2014","1/12/2012")), end = mdy(c("3/31/2011","6/4/2012","3/31/2017","3/31/2017","2/12/2014")), colour = c("blue", "purple", "blue", "red", "purple"))

I used a response from a similar question, Checking if Date is Between two Dates in R and used the code below:

    library(dplyr)
    dat <- inner_join(df1, df2, by = "ID")
    dat %>% rowwise() %>%
    mutate(match = ifelse(between(df1$Date, df2$start, df2$end), 1 , 0))%>%
    select(-c(df2$start, df2$end))%>%
    arrange(df1$Date, desc(match))%>%
    distinct(df1$Date)

and I get the following error:

Error in between(df1$Date, df2$start, df2$end) : Expecting a single value: [extent=355368].

help?

Thanks so much!

Update-

Thanks so much everyone for your answers.

I tried them all but all the final datasets have different number of rows than the first dataset. I am not sure what is happening. The data I have posted is made-up to resemble the data I am working with. Are there additional details that I should let you know? I don't know where to start...


Solution

  • dplyr uses non standard evaluation and so you can dump all the dataframe names and $s and your code begins basically in the right direction. There are additionally a number of implicit transformations necessary for you to end up with the data frame you specified, but this below will get you there.

    dat <- 
        df1 %>% 
        inner_join(df2) %>%
        rowwise %>% 
        mutate(match = ifelse(between(Date, start, end), 1 , NA)) %>%
        arrange(ID, Date, desc(match)) %>%
        ungroup %>% 
        group_by(ID, Date) %>% 
        mutate(best = row_number(ID), 
               colour = if_else(is.na(match), NA_character_, colour)) %>%
        filter(best == 1) %>% 
        select(ID, Date, colour) 
    
    > dat
        # A tibble: 4 x 3
        # Groups:   ID, Date [4]
             ID       Date colour
          <dbl>     <date>  <chr>
        1     1 2017-03-31   blue
        2     2 2016-02-11    red
        3     2 2016-04-10    red
        4     3 2015-05-15   <NA>