Search code examples
rdataframefiltersubstringfuzzyjoin

How can I filter one dataframe based on two columns of another dataframe where one column is the exact match and the other one is a substring match?


I have two dataframes, both have a Last_Name column. First dataframe has a column Contains_First_Name and the second has a column called First_Name. I want to join the two on the exact spelling of Last_Name and a substring match of Contains_First_Name and First_Name (where First_Name is a substring of Contains_First_Name.) Please see an example below.

library(dplyr)
library(stringr)

# Create df1
Last_Name <- c("Smith", "Jones", "Adams", "Rogers", "Lee", "Lee", "Lee")
Contains_First_Name <- c("Kimberly Nicole", "Patrick L", "Johnson Ann", "Rick", "McAdams Jennifer Marie", "Kirk", "Kirk B")
Account_Number <- c("123", "345", "678", "901", "234", "567", "890")

df1 <- data.frame(Last_Name, Contains_First_Name, Account_Number)

# Create df2
Last_Name <- c("Smith", "Jones", "Adams", "Lee", "Lee")
First_Name <- c("Kimberly", "Patrick", "Ann", "Jennifer", "Kirk")

df2 <- data.frame(Last_Name, First_Name)

Resulting dataframes:

> df1
  Last_Name    Contains_First_Name Account_Number
1     Smith        Kimberly Nicole            123
2     Jones              Patrick L            345
3     Adams            Johnson Ann            678
4    Rogers                   Rick            901
5       Lee McAdams Jennifer Marie            234
6       Lee                   Kirk            567
7       Lee                 Kirk B            890
> df2
  Last_Name First_Name
1     Smith   Kimberly
2     Jones    Patrick
3     Adams        Ann
4       Lee   Jennifer
5       Lee       Kirk

What I want to end up with:

> df3
  Last_Name    Contains_First_Name Account_Number First_Name
1     Smith        Kimberly Nicole            123 Kimberly
2     Jones              Patrick L            345 Patrick
3     Adams            Johnson Ann            678 Ann
4       Lee McAdams Jennifer Marie            234 Jennifer
5       Lee                   Kirk            567 Kirk
6       Lee                 Kirk B            890 Kirk

I tried this:

df3 <-
  filter(df1,
         Last_Name %in% df2$Last_Name,
         str_detect(Contains_First_Name, paste(df2$First_Name, collapse = "|")))

Getting the following error:

Error in match.arg(method) : 'arg' must be NULL or a character vector

I also explored fuzzyjoin library but could not figure out how to join on two variables with two different join types (exact and substring.) I saw a similar question which does not appear to have an answer: Merge two dataframes based on an exact match in one column and match within an error in another column in R. Any advice is greatly appreciated. Thank you.


Solution

  • I'd say you have two options: either use an equi-join on only the first column and filter later or use fuzzyjoin, as you described:

    # Approach 1: Match all, filter later
    
    inner_join(df1, df2, join_by(Last_Name), relationship = "many-to-many") |> 
      filter(str_detect(Contains_First_Name, First_Name))
    #> # A tibble: 6 × 4
    #>   Last_Name Contains_First_Name    Account_Number First_Name
    #>   <chr>     <chr>                  <chr>          <chr>     
    #> 1 Smith     Kimberly Nicole        123            Kimberly  
    #> 2 Jones     Patrick L              345            Patrick   
    #> 3 Adams     Johnson Ann            678            Ann       
    #> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
    #> 5 Lee       Kirk                   567            Kirk      
    #> 6 Lee       Kirk B                 890            Kirk
    
    # Approach 2: fuzzyjoin
    
    fuzzyjoin::fuzzy_inner_join(
      df1,
      df2,
      by = c("Last_Name" = "Last_Name", "Contains_First_Name" = "First_Name"),
      match_fun = list(`==`, \(x, y) str_detect(x, y))
    ) |> 
      select(!Last_Name.y) |> 
      rename(Last_Name = Last_Name.x)
    #> # A tibble: 6 × 4
    #>   Last_Name Contains_First_Name    Account_Number First_Name
    #>   <chr>     <chr>                  <chr>          <chr>     
    #> 1 Smith     Kimberly Nicole        123            Kimberly  
    #> 2 Jones     Patrick L              345            Patrick   
    #> 3 Adams     Johnson Ann            678            Ann       
    #> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
    #> 5 Lee       Kirk                   567            Kirk      
    #> 6 Lee       Kirk B                 890            Kirk
    

    Created on 2024-01-06 with reprex v2.0.2