Search code examples
rstringleft-joinfuzzyjoin

Fuzzy Left Join exact + partial string match


I'm using a fuzzy_left_join function to match tables with exact + fuzzy matching.

One of the match_fun arguments that I'm using involves checking if part of a string is contained inside another string. When only using exact matching, it returns the values, but when adding the function below, the join only returns NA values.

Function to compare strings:

detect <- function(x, y){ 
  any(unlist(strsplit(x, split = "/")) %in% unlist(strsplit(y, split = "/")))
    }

The idea is that on the first table X, a column named "Names" has row values like "a/b/c" and on column "Names" from second table Y, values also like "a/d/e", hence the "a" part of the string is contained on the second value as well, and a TRUE should return (and thus allowing the joining).

When using a simple join, without the custom function, it works and finds some values:

x <- tribble(~X1, ~X2, ~Names,
        #--|--|----
        "5000", "a", "a/b/c",
        "6000", "b", "d/e/f",
        "7000", "c", "g/h/i")

y <- tribble(~Y1, ~Y2, ~Names,
        #--|--|----
        "5000", "a", "a/j/k",
        "6000", "b", "l/m/n",
        "8000", "d", "o/p/q")

fuzzyjoin::fuzzy_left_join(x, y, by = c("X1" = "Y1", "X2" = "Y2"),
                           match_fun = list(`==`, `==`))

# Produces:

#> A tibble: 3 x 6
#>   X1    X2    Names.x           Y1    Y2    Names.y      
#>  <chr> <chr> <chr>             <chr> <chr> <chr>        
#> 1 5000  a     a/b/c             5000  a     a/j/k
#> 2 6000  b     d/e/f             6000  b     l/m/n  
#> 3 7000  c     g/h/i             NA    NA    NA           

But when adding the custom function:

fuzzyjoin::fuzzy_left_join(x, y, by = c("X1" = "Y1", "X2" = "Y2", "Names" = "Names"),
                           match_fun = list(`==`, `==`, detect))

# Produces:

#> A tibble: 3 x 6
#> X1    X2    Names.x             Y1    Y2    Names.y
#> <chr> <chr> <chr>             <chr> <chr> <chr>  
#> 1 5000  a     a/b/c             NA    NA    NA     
#> 2 6000  b     d/e/f             NA    NA    NA     
#> 3 7000  c     g/h/i             NA    NA    NA     


# Intended:

#> A tibble: 3 x 6
#> X1    X2    Names.x             Y1    Y2    Names.y
#> <chr> <chr> <chr>              <chr> <chr> <chr>  
#> 1 5000  a     a/b/c             5000  a     a/j/k
#> 2 6000  b     d/e/f             NA    NA    NA     
#> 3 7000  c     g/h/i             NA    NA    NA   

Could you please give some thoughts?


Solution

  • The function applied in match_fun doesn't apply to one combination at a time. It applies the function to all combinations so you need to change the detect function :

    detect <- function(x, y){ 
      mapply(function(x, y) any(x == y), strsplit(x, '/'), strsplit(y, '/'))
    }
    

    and then try :

    fuzzyjoin::fuzzy_left_join(x, y, by = c("X1" = "Y1", "X2" = "Y2", "Names"),
                               match_fun = list(`==`, `==`, detect))
    
    #  X1    X2    Names.x Y1    Y2    Names.y
    #  <chr> <chr> <chr>   <chr> <chr> <chr>  
    #1 5000  a     a/b/c   5000  a     a/j/k  
    #2 6000  b     d/e/f   NA    NA    NA     
    #3 7000  c     g/h/i   NA    NA    NA