Search code examples
rdplyrleft-join

How to perform a left_join on multiple columns with at least one match or both?


I have two dataframes df_1 and df_2 (see dput() below) that I want to join using a left_join, based on multiple columns. However, I want to perform a successfull join if AT LEAST one of the columns matches, if not more.

I am currently using the code

join <- left_join(df_1, df_2, by = join_by(Txt_y == Txt_x, Head == Tail), keep = TRUE, na_matches = "never", multiple = "warning", unmatched = "drop")

How would I need to adapt join_by() to get a match if AT LEAST ONE column has a match?

Here is the dput() for both dataframes:

structure(list(Name_y = c("etwbv", "werg", "sdfg", "qwreg", "gvr", 
"wref"), URL_y = c("ewrg", "werg", "asd", "qwe", "gvre", "vrw"
), Txt_y = c("abc", "bfh", "fse", "rege", "wer", "vwr"), Head = c("abc1", 
"bfh", "fse", "rege1", "wer", "vwr")), class = "data.frame", row.names = c(NA, 
-6L))

structure(list(Name_x = c("etwbv", "werg", "sdfg", "qwreg", "gvr", 
"wref"), URL_x = c("ewrg", "werg", "asd", "qwe", "gvre", "vrw"
), Txt_x = c("abc", "rfwcq", "fse", "wefc", "aefc", "vwr"), Tail = c("abc1", 
"bfh", "fse", "wreg", "ecdw", "vwr")), class = "data.frame", row.names = c(NA, 
-6L))

And this is what I would like as an output:

Name_y  URL_y   Txt_y  Head  Name_x URL_x  Txt_x  Tail 
etwbv   ewrg    abc    abc1  etwbv  ewrg   abc    abc1
werg    werg    bfh    bfh   werg   werg   rfwcq  bfh
sdfg    asd     fse    fse   sdfg   asd    fse    fse
qwreg   qwe     rege   rege1 NA     NA     NA     NA
gvr     gvre    wer    wer   NA     NA     NA     NA
wref    vrw     vwr    vwr   wref   vrw    vwr    vwr

With the code above, I currently get NAs for the second row for df_2, but since there is at least one match between columns Head and Tail I would like to get that match into the new dataframe join. How can I adapt my code to accomplish this?


Solution

  • Using {powerjoin} :

    library(powerjoin)
    power_left_join(df_1, df_2, by = ~ .x$Txt_y == .y$Txt_x | .x$Head == .y$Tail)
    #>   Name_y URL_y Txt_y  Head Name_x URL_x Txt_x Tail
    #> 1  etwbv  ewrg   abc  abc1  etwbv  ewrg   abc abc1
    #> 2   werg  werg   bfh   bfh   werg  werg rfwcq  bfh
    #> 3   sdfg   asd   fse   fse   sdfg   asd   fse  fse
    #> 4  qwreg   qwe  rege rege1   <NA>  <NA>  <NA> <NA>
    #> 5    gvr  gvre   wer   wer   <NA>  <NA>  <NA> <NA>
    #> 6   wref   vrw   vwr   vwr   wref   vrw   vwr  vwr
    

    Created on 2023-03-16 with reprex v2.0.2

    If you have some NAs in some columns used in by this will fail with the CRAN version.

    df_1$Txt_y[1] <- NA
    
    power_left_join(df_1, df_2, by = ~ .x$Txt_y == .y$Txt_x | .x$Head == .y$Tail)
    #> Error in if (sum(m) == 0) {: missing value where TRUE/FALSE needed
    

    You'll need the GitHub version (devtools::install_github("moodymudskipper/powerjoin")) or the following workaround :

    power_left_join(df_1, df_2, by = ~ dplyr::if_else(.x$Txt_y == .y$Txt_x | .x$Head == .y$Tail, TRUE, FALSE, FALSE))
    #>   Name_y URL_y Txt_y  Head Name_x URL_x Txt_x Tail
    #> 1  etwbv  ewrg  <NA>  abc1  etwbv  ewrg   abc abc1
    #> 2   werg  werg   bfh   bfh   werg  werg rfwcq  bfh
    #> 3   sdfg   asd   fse   fse   sdfg   asd   fse  fse
    #> 4  qwreg   qwe  rege rege1   <NA>  <NA>  <NA> <NA>
    #> 5    gvr  gvre   wer   wer   <NA>  <NA>  <NA> <NA>
    #> 6   wref   vrw   vwr   vwr   wref   vrw   vwr  vwr
    

    I believe it makes sense to dismiss the NAs by default though. So this behaviour might be enhanced in a further version (I'm the author). For sure the error message is not satisfying.