Joining two dataframes on a condition (grepl)

I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+                          match_fun = c("=", "grepl()"),
+                          mode="left")
Error in which(m) : argument to 'which' is not logical

Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?

Solution

TLDR

You just need to fix match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...

Background

You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

Solution

A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
  df1, df2,

  by        =       c("ages", "fullnames" = "lastnames"),
  #                   |----|  |-----------------------|
  match_fun =    list(`==`  , stringr::str_detect      ),
  #                   |--|    |-----------------|
  #   Match by equality ^      ^ Match by detection of `lastnames` in `fullnames`    

  mode = "left"
) %>%
  # Format resulting dataset as you requested.
  select(fullnames, ages = ages.x, homestate)

Result

Given your sample data reproduced here

df1 <- data.frame(
  fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
  ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
  lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
  ages = c(30, 45, 20, 28, 51, 38),
  homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

this solution should produce the following data.frame for joined_dfs, formatted as requested:

        fullnames ages homestate
1       Jane Doe   30        NJ
2 Mr. John Smith   51        MA
3 Nate Cox, Esq.   45        CT
4   Bill Lee III   38        NY
5 Ms. Kate Smith   20        MA

Note

Because each ages is coincidentally a unique key, the following join on only *names

fuzzy_join(
  df1, df2,
  by = c("fullnames" = "lastnames"),
  match_fun = stringr::str_detect,
  mode = "left"
)

will better illustrate the behavior of matching on substrings:

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     20        MA
3 Mr. John Smith     51     Smith     51        MA
4 Nate Cox, Esq.     45       Cox     45        CT
5   Bill Lee III     38       Lee     38        NY
6 Ms. Kate Smith     20     Smith     20        MA
7 Ms. Kate Smith     20     Smith     51        MA

Where You Went Wrong

Error in Type

The value passed to match_fun should be either (the symbol for) a function

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = grepl
  # ...
)

or a list of such (symbols for) functions:

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = list(`=`, grepl)
  # ...
)

Instead of providing a list of symbols

match_fun = list(=, grepl)

you incorrectly provided a vector of character strings:

match_fun = c("=", "grepl()")

Error in Syntax

The user should name the functions

`=`
grepl

yet you incorrectly attempted to call them:

=
grepl()

Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

* Assuming the calls didn't fail with errors. Here, they would fail.

Inappropriate Functions

To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.

Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

its first argument (pattern) is (treated as) a single character string:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

Thus, grepl() is not a

Vectorized function given two columns...

but rather a function given one string (scalar) and one column (vector) of strings.

The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

and which wraps stringi::stri_detect().

Note

Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.

If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.