Search code examples
rdataframedatetimedplyrsubset

R: Import values from one dataframe to another by matching dates


Assume I have df1:

Start_Date    End_Date     Value
2001-01-01    2001-12-31   1
2002-01-01    2002-12-31   2
2003-01-01    2003-12-31   3
2004-01-01    2004-12-31   4
2005-01-01    2005-12-31   5 

& df2:

DateTime      Gain   People
2003-01-01    3      3
2003-05-09    5      4
2004-12-31    1      2
2005-01-31    -2     2
2005-08-13    9      7
2006-09-10    6      8
2007-10-03    7      5

What I would like to do is to import Value from df1 into df2 by checking which DateTime in df2 falls in between Start_Date & End_Date in df1. If dates do not fall into any of periods in df1, return Value 0. A view of desired outcome:

DateTime      Gain   People   Value
2003-01-01    3      3        3
2003-05-09    5      4        3
2004-12-31    1      2        4
2005-01-31    -2     2        5
2005-08-13    9      7        5
2006-09-10    6      8        0
2007-10-03    7      5        0

Please advise


Solution

  • The underlying challenge is joining on inequality conditions. These are straightforward in sql, but not in R as dplyr only joins on equality. Here is the a link to the primary answer on S.O.

    The best way I have found is to do a more general join and then filter for the inequalities.

    df1 = df1 %>% mutate(join_id = 1)
    df2 = df2 %>% mutate(join_id = 1)
    
    output = left_join(df2, df1, by = 'join_id') %>%
      filter(Start_Date <= DateTime,
             DateTime <= End_Date) %>%
      select(DateTime, Gain, People, Value) %>%
      mutate(Value = ifelse(is.na(Value), 0, Value)
    

    Explanation:

    • As your dataframes do not have an existing ID column we first create one. This is unnecessary if you want to join on some equality constraints and some inequality constraints.
    • We use a (much) more general join followed by a filter for the inequality constraints.
    • ifelse(is.na(... is used to replace the missing values.

    In some cases this much more general join can create performance issues. But as R uses lazy-evaluation by default, if you filter immediately after the join then R should run both statements as part of the same join and avoid any performance problem.