Search code examples
rdplyrleft-joinr-faq

dplyr left_join by less than, greater than condition


This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <= and >. Does dplyr::left_join() support this feature? or do the keys only take = operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdata that are in between two survey years, I join the corresponding survey year data.

id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 

Unless if left_join can handle the condition, but my syntax is missing something?


Solution

  • The original answer below is out of date, as pointed out in another answer. With newer versions of dplyr, simply use the following. (Note that this syntax works even with database backends using dbplyr.)

    fdata %>% 
    left_join(sdata,
              join_by(fyear >= byear, fyear < eyear))
    

    When the original answer was created, there was no easy way to do inequality joins using dplyr.

    Original answer

    Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

    The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

    fdata %>% 
        mutate(dummy=TRUE) %>%
        left_join(sdata %>% mutate(dummy=TRUE)) %>%
        filter(fyear >= byear, fyear < eyear) %>%
        select(-dummy)
    

    And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

    > fdata %>% 
    +     mutate(dummy=TRUE) %>%
    +     left_join(sdata %>% mutate(dummy=TRUE)) %>%
    +     filter(fyear >= byear, fyear < eyear) %>%
    +     select(-dummy) %>%
    +     explain()
    Joining by: "dummy"
    <SQL>
    SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
    FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
    FROM "fdata") AS "zzz136"
    
    LEFT JOIN 
    
    (SELECT "byear", "eyear", "val", TRUE AS "dummy"
    FROM "sdata") AS "zzz137"
    
    USING ("dummy")) AS "zzz138"
    WHERE "fyear" >= "byear" AND "fyear" < "eyear"
    
    
    <PLAN>
    Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
      Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
      ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
      ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
            ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)
    

    and doing it more cleanly with SQL gives exactly the same result:

    > tbl(pg, sql("
    +     SELECT *
    +     FROM fdata 
    +     LEFT JOIN sdata 
    +     ON fyear >= byear AND fyear < eyear")) %>%
    +     explain()
    <SQL>
    SELECT "id", "fyear", "byear", "eyear", "val"
    FROM (
        SELECT *
        FROM fdata 
        LEFT JOIN sdata 
        ON fyear >= byear AND fyear < eyear) AS "zzz140"
    
    
    <PLAN>
    Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
      Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
      ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
      ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
            ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)