Assume I have df1:
Start_Date End_Date Value
2001-01-01 2001-12-31 1
2002-01-01 2002-12-31 2
2003-01-01 2003-12-31 3
2004-01-01 2004-12-31 4
2005-01-01 2005-12-31 5
& df2:
DateTime Gain People
2003-01-01 3 3
2003-05-09 5 4
2004-12-31 1 2
2005-01-31 -2 2
2005-08-13 9 7
2006-09-10 6 8
2007-10-03 7 5
What I would like to do is to import Value from df1 into df2 by checking which DateTime in df2 falls in between Start_Date & End_Date in df1. If dates do not fall into any of periods in df1, return Value 0. A view of desired outcome:
DateTime Gain People Value
2003-01-01 3 3 3
2003-05-09 5 4 3
2004-12-31 1 2 4
2005-01-31 -2 2 5
2005-08-13 9 7 5
2006-09-10 6 8 0
2007-10-03 7 5 0
Please advise
The underlying challenge is joining on inequality conditions. These are straightforward in sql, but not in R as dplyr only joins on equality. Here is the a link to the primary answer on S.O.
The best way I have found is to do a more general join and then filter for the inequalities.
df1 = df1 %>% mutate(join_id = 1)
df2 = df2 %>% mutate(join_id = 1)
output = left_join(df2, df1, by = 'join_id') %>%
filter(Start_Date <= DateTime,
DateTime <= End_Date) %>%
select(DateTime, Gain, People, Value) %>%
mutate(Value = ifelse(is.na(Value), 0, Value)
Explanation:
ifelse(is.na(...
is used to replace the missing values.In some cases this much more general join can create performance issues. But as R uses lazy-evaluation by default, if you filter immediately after the join then R should run both statements as part of the same join and avoid any performance problem.