Search code examples
rsqldffuzzyjoin

Selective left join in r


I want to selectively left join two dataframes based on a joint column and the condition of rows.

I saw some similar posts using fuzzyjoin and sqldf, but the previous examples I found are not exactly like mine.

Example dfs:

df1 <- data.frame(id = c("1", "2", "3"),
              zipcode = c("11111", "44444", "33333"),
              exp.id = c("0", "0", "1"))
df2 <- data_frame(zipcode = c("11111", "22222", "33333", "44444", "55555"),
              pct = c("0.1", "0.5", "0.9", "0.7", "0.8"))

Basically, I want to join the "pct" column in df2 to df1 by zipcode but only join where "exp.id" = "0"

The outcome I expect should look like this:

  id    zipcode exp.id pct  
 <chr> <chr>   <chr>  <chr>
1 1     11111   0      0.1  
2 2     44444   0      0.7  
3 3     33333   1      NA  

Thank you in advance.


Solution

  • 1) This left joins df1 with df2 on zipcode but only joins rows for which exp.id is 0. For other rows pct is NA as in the expected result shown in the question. Note that dot is an SQL operator so we surround exp.id with square brackets to escape the name.

    library(sqldf)
    
    sqldf("select a.id, a.zipcode, b.pct
      from df1 a 
      left join df2 b on a.zipcode = b.zipcode and [exp.id] = 0")
    ##   id zipcode  pct
    ## 1  1   11111  0.1
    ## 2  2   44444  0.7
    ## 3  3   33333 <NA>
    

    2) This is like (1) but returns only the exp.id rows that are zero. This is different than what is asked for in the question but a comment suggested that it is of interest.

    The difference between the code here and (1) illustrate the subtle difference between including a condition in on and in where. Because we have a simple condition in this case we can use the using clause instead of on. using results in a single zipcode so we don't need to distinguish between a.zipcode and b.zipcode.

    sqldf("select a.id, zipcode, b.pct
      from df1 a left join df2 b using(zipcode)
      where [exp.id] = 0")
    ##   id zipcode pct
    ## 1  1   11111 0.1
    ## 2  2   44444 0.7
    

    Note that the SQL engine internally creates a query plan to optimize the calculation while maintaining the same output. It does not necessarily perform the operations in the order written, i.e. it does not necessarily perform the join and then reduce the result but may reduce df1 first to improve performance as that gives the same result. We display information on the query plan below and we see that, indeed, it scans df1 first.

    sqldf("explain query plan select a.id, zipcode, b.pct
          from df1 a left join df2 b using(zipcode)
          where [exp.id] = 0")
    ##   id parent notused                                                           detail
    ## 1  3      0       0                                              SCAN TABLE df1 AS a
    ## 2 16      0       0 SEARCH TABLE df2 AS b USING AUTOMATIC COVERING INDEX (zipcode=?)