Search code examples
rleft-joingreplfuzzyjoin

Joining two dataframes on a condition (grepl)


I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+                          match_fun = c("=", "grepl()"),
+                          mode="left")
Error in which(m) : argument to 'which' is not logical

Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?


Solution

  • TLDR

    You just need to fix match_fun:

    # ...
    match_fun = list(`==`, stringr::str_detect),
    # ...
    

    Background

    You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

    Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

    Solution

    A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

    library(dplyr)
    
    # ...
    # Existing code
    # ...
    
    joined_dfs <- fuzzy_join(
      df1, df2,
    
      by        =       c("ages", "fullnames" = "lastnames"),
      #                   |----|  |-----------------------|
      match_fun =    list(`==`  , stringr::str_detect      ),
      #                   |--|    |-----------------|
      #   Match by equality ^      ^ Match by detection of `lastnames` in `fullnames`    
    
      mode = "left"
    ) %>%
      # Format resulting dataset as you requested.
      select(fullnames, ages = ages.x, homestate)
    

    Result

    Given your sample data reproduced here

    df1 <- data.frame(
      fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
      ages = c(30, 51, 45, 38, 20)
    )
    
    df2 <- data.frame(
      lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
      ages = c(30, 45, 20, 28, 51, 38),
      homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
    )
    

    this solution should produce the following data.frame for joined_dfs, formatted as requested:

            fullnames ages homestate
    1       Jane Doe   30        NJ
    2 Mr. John Smith   51        MA
    3 Nate Cox, Esq.   45        CT
    4   Bill Lee III   38        NY
    5 Ms. Kate Smith   20        MA
    

    Note

    Because each ages is coincidentally a unique key, the following join on only *names

    fuzzy_join(
      df1, df2,
      by = c("fullnames" = "lastnames"),
      match_fun = stringr::str_detect,
      mode = "left"
    )
    

    will better illustrate the behavior of matching on substrings:

           fullnames ages.x lastnames ages.y homestate
    1       Jane Doe     30       Doe     30        NJ
    2 Mr. John Smith     51     Smith     20        MA
    3 Mr. John Smith     51     Smith     51        MA
    4 Nate Cox, Esq.     45       Cox     45        CT
    5   Bill Lee III     38       Lee     38        NY
    6 Ms. Kate Smith     20     Smith     20        MA
    7 Ms. Kate Smith     20     Smith     51        MA
    

    Where You Went Wrong

    Error in Type

    The value passed to match_fun should be either (the symbol for) a function

    fuzzyjoin::fuzzy_join(
      # ...
      match_fun = grepl
      # ...
    )
    

    or a list of such (symbols for) functions:

    fuzzyjoin::fuzzy_join(
      # ...
      match_fun = list(`=`, grepl)
      # ...
    )
    

    Instead of providing a list of symbols

    match_fun = list(=, grepl)
    

    you incorrectly provided a vector of character strings:

    match_fun = c("=", "grepl()")
    

    Error in Syntax

    The user should name the functions

    `=`
    grepl
    

    yet you incorrectly attempted to call them:

    =
    grepl()
    

    Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

    * Assuming the calls didn't fail with errors. Here, they would fail.

    Inappropriate Functions

    To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.

    Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

    a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

    its first argument (pattern) is (treated as) a single character string:

    character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

    Thus, grepl() is not a

    Vectorized function given two columns...

    but rather a function given one string (scalar) and one column (vector) of strings.

    The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

    Vectorised over string and pattern. Equivalent to grepl(pattern, x).

    and which wraps stringi::stri_detect().

    Note

    Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.

    If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.