Search code examples
rfuzzyjoin

fuzzy_left_join with match_fun %in%


Some data

example_df <- data.frame(
  url = c('blog/blah', 'blog/?utm_medium=foo', 'blah', 'subscription/apples', 'UK/something'),
  numbs = 1:5
)

lookup_df <- data.frame(
  string = c('blog', 'subscription', 'UK'),
  group = c('blog', 'subs', 'UK')
)


library(fuzzyjoin)
data_combined <- example_df %>% 
  fuzzy_left_join(lookup_df, by = c("url" = "string"), 
                  match_fun = `%in%`)

data_combined
                   url numbs string group
1            blog/blah     1   <NA>  <NA>
2 blog/?utm_medium=foo     2   <NA>  <NA>
3                 blah     3   <NA>  <NA>
4  subscription/apples     4   <NA>  <NA>
5         UK/something     5   <NA>  <NA>

I expected data_combined to have values for string and group where there's a match based on match_fun. Instead all NA.

Example, the first value of string in lookup_df is 'blog'. Since this is %in% the first value of example_df string, expected a match with value 'blog' and 'blog' in string and group fields.


Solution

  • If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join

    library(dplyr)
    library(fuzzyjoin)
    library(stringr)
    example_df %>%
        mutate(string = str_remove(url, "\\/.*")) %>% 
        regex_left_join(lookup_df, by = 'string') %>%
        select(url, numbs, group)
    

    -output

    #                   url numbs group
    #1            blog/blah     1  blog
    #2 blog/?utm_medium=foo     2  blog
    #3                 blah     3  <NA>
    #4  subscription/apples     4  subs
    #5         UK/something     5    UK