I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,
df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20))
fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20
df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY
I want to do a left join on these two dataframes on ages and the row in which df2$lastnames
is contained within df1$fullnames
. I thought fuzzy_join
might do it, but I don't think it liked my grepl
:
joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"),
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical
Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?
You just need to fix match_fun
:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
You had the right idea, but you went wrong in your interpretation of the match_fun
parameter in fuzzyjoin::fuzzy_join()
. Per the documentation, match_fun
should be a
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in
by
(if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
A simple correction will do the trick, with further formatting by dplyr
. For conceptual clarity, I've typographically aligned the by
columns with the function
s used to match them:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
Given your sample data reproduced here
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
this solution should produce the following data.frame
for joined_dfs
, formatted as requested:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
Because each ages
is coincidentally a unique key, the following join on only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
will better illustrate the behavior of matching on substrings:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
The value passed to match_fun
should be either (the symbol
for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
or a list
of such (symbol
s for) function
s:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
Instead of providing a list
of symbol
s
match_fun = list(=, grepl)
you incorrectly provided a vector
of character
strings:
match_fun = c("=", "grepl()")
The user should name the function
s
`=`
grepl
yet you incorrectly attempted to call them:
=
grepl()
Naming them will pass the function
s themselves to match_fun
, as intended, whereas calling them will pass their return values*. In R, an operator like =
is named using backticks: `=`
.
* Assuming the calls didn't fail with errors. Here, they would fail.
To compare two values for equality, here the character
vectors df1$fullnames
and df2$lastnames
, you should use the relational operator ==
; yet you incorrectly supplied the assignment operator =
.
Furthermore grepl()
is not vectorized in quite the way match_fun
desires. While its second argument (x
) is indeed a vector
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
its first argument (pattern
) is (treated as) a single character
string:
character string containing a regular expression (or character string for
fixed = TRUE
) to be matched in the given character vector. Coerced byas.character
to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except forregexpr
,gregexpr
andregexec
.
Thus, grepl()
is not a
Vectorized function given two columns...
but rather a function
given one string (scalar) and one column (vector) of strings.
The answer to your prayers is not grepl()
but rather something like stringr::str_detect()
, which is
Vectorised over
string
andpattern
. Equivalent togrepl(pattern, x)
.
and which wraps stringi::stri_detect()
.
Since you're simply trying to detect whether a literal string in df1$fullnames
contains a literal string in df2$lastnames
, you don't want to accidentally treat the strings in df2$lastnames
as regular expression patterns. Now your df2$lastnames
column is statistically unlikely to contain names with special regex characters; with the lone exception of -
, which is interpreted literally outside of []
, which are very unlikely to be found in a name.
If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed()
or stringi::stri_detect_coll()
. These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.