Search code examples
rdataframejoindplyrtidyverse

How to join dataframes using a prefix as a match?


Suppose I have those data:

data <- tibble(
  x = c("ANOTHER", "COMMON", "ZEBRA")
)

And I want to join this dataframe with this one:

EDIT: The prefix has no fixed length

selection <- tibble(
  x_prefix = c("A", "B", "CO"),
  type = c("One type", "Other type", "Other type")
)

I want to make an INNER JOIN with the x and x_prefix columns, with the idea that a line is kept if x_prefix is a prefix of x.

The expected answer is:

answer <- tibble(
  x = c("ANOTHER", "COMMON"),
  type = c("One type", "Other type")
)

How one can do that with dplyr?


Solution

  • You can do a fuuzy_inner_join with str_detect as the matching function. The symbol "^" is from regex, meaning start of a string, therefore we need to paste it to your pattern to only match strings starting with x_prefix.

    library(fuzzyjoin)
    library(tidyverse)
    
    fuzzy_inner_join(data, selection, by = c("x" = "x_prefix"), 
                     match_fun = \(x, y) str_detect(x, paste0("^", y))) |> 
      select(-x_prefix)
    
    # A tibble: 2 x 2
      x       type      
      <chr>   <chr>     
    1 ANOTHER One type  
    2 COMMON  Other type