I have two dataframes df_1
and df_2
(see dput()
below) that I want to join using a left_join, based on multiple columns. However, I want to perform a successfull join if AT LEAST one of the columns matches, if not more.
I am currently using the code
join <- left_join(df_1, df_2, by = join_by(Txt_y == Txt_x, Head == Tail), keep = TRUE, na_matches = "never", multiple = "warning", unmatched = "drop")
How would I need to adapt join_by()
to get a match if AT LEAST ONE column has a match?
Here is the dput()
for both dataframes:
structure(list(Name_y = c("etwbv", "werg", "sdfg", "qwreg", "gvr",
"wref"), URL_y = c("ewrg", "werg", "asd", "qwe", "gvre", "vrw"
), Txt_y = c("abc", "bfh", "fse", "rege", "wer", "vwr"), Head = c("abc1",
"bfh", "fse", "rege1", "wer", "vwr")), class = "data.frame", row.names = c(NA,
-6L))
structure(list(Name_x = c("etwbv", "werg", "sdfg", "qwreg", "gvr",
"wref"), URL_x = c("ewrg", "werg", "asd", "qwe", "gvre", "vrw"
), Txt_x = c("abc", "rfwcq", "fse", "wefc", "aefc", "vwr"), Tail = c("abc1",
"bfh", "fse", "wreg", "ecdw", "vwr")), class = "data.frame", row.names = c(NA,
-6L))
And this is what I would like as an output:
Name_y URL_y Txt_y Head Name_x URL_x Txt_x Tail
etwbv ewrg abc abc1 etwbv ewrg abc abc1
werg werg bfh bfh werg werg rfwcq bfh
sdfg asd fse fse sdfg asd fse fse
qwreg qwe rege rege1 NA NA NA NA
gvr gvre wer wer NA NA NA NA
wref vrw vwr vwr wref vrw vwr vwr
With the code above, I currently get NAs for the second row for df_2
, but since there is at least one match between columns Head
and Tail
I would like to get that match into the new dataframe join
. How can I adapt my code to accomplish this?
Using {powerjoin} :
library(powerjoin)
power_left_join(df_1, df_2, by = ~ .x$Txt_y == .y$Txt_x | .x$Head == .y$Tail)
#> Name_y URL_y Txt_y Head Name_x URL_x Txt_x Tail
#> 1 etwbv ewrg abc abc1 etwbv ewrg abc abc1
#> 2 werg werg bfh bfh werg werg rfwcq bfh
#> 3 sdfg asd fse fse sdfg asd fse fse
#> 4 qwreg qwe rege rege1 <NA> <NA> <NA> <NA>
#> 5 gvr gvre wer wer <NA> <NA> <NA> <NA>
#> 6 wref vrw vwr vwr wref vrw vwr vwr
Created on 2023-03-16 with reprex v2.0.2
If you have some NAs in some columns used in by
this will fail with the CRAN version.
df_1$Txt_y[1] <- NA
power_left_join(df_1, df_2, by = ~ .x$Txt_y == .y$Txt_x | .x$Head == .y$Tail)
#> Error in if (sum(m) == 0) {: missing value where TRUE/FALSE needed
You'll need the GitHub version (devtools::install_github("moodymudskipper/powerjoin")
) or the following workaround :
power_left_join(df_1, df_2, by = ~ dplyr::if_else(.x$Txt_y == .y$Txt_x | .x$Head == .y$Tail, TRUE, FALSE, FALSE))
#> Name_y URL_y Txt_y Head Name_x URL_x Txt_x Tail
#> 1 etwbv ewrg <NA> abc1 etwbv ewrg abc abc1
#> 2 werg werg bfh bfh werg werg rfwcq bfh
#> 3 sdfg asd fse fse sdfg asd fse fse
#> 4 qwreg qwe rege rege1 <NA> <NA> <NA> <NA>
#> 5 gvr gvre wer wer <NA> <NA> <NA> <NA>
#> 6 wref vrw vwr vwr wref vrw vwr vwr
I believe it makes sense to dismiss the NAs by default though. So this behaviour might be enhanced in a further version (I'm the author). For sure the error message is not satisfying.