Search code examples
rstringmergepartial

Merge two datasets by partial string - R


I have two datasets both with different columns however both with names. I wanted to match the names and concatenate the rows to each other

df1 <- data.frame (Name = c("Smith, Carl", "Jones, Allan", "Waterman, Josh", "Woods, Alex"),
 number1 = c(1, 2, 3, 4))

df2 <- data.frame(fullname = c("Carl Smith", "Allan Jones", "Josh Waterman", "Alex Woods"),
research = c("genetics", "genetics", "mathematics", "plant"))

I would like to combine the dataset into

Name Number1 Research fullname
Smith, Carl 1 genetics Carl Smith
Jones, Allan 2 genetics Allan Jones
Waterman, Josh 3 mathematics Josh Waterman
Woods, Alex 4 plant Alex Woods

I have tried merge(df1, df2, by.x = "Name", by.y = "fullname", all = TRUE) but it doesn't match the partial names, it gives:

Name Number1 Research fullname
Smith, Carl 1 - -
Jones, Allan 2 - -
Waterman, Josh 3 - -
Woods, Alex 4 - -
- - genetics Carl Smith
- - genetics Allan Jones
- - mathematics Josh Waterman
- - plant Alex Woods

Solution

  • This could be solved by a fuzzy (=not needing perfectly matching conditions) join of both datasets. This is different from merge as it joins two tables based on fuzzy string matching. Beware that this does not always work perfectly and may produce incorrectly joined rows. If the join is producing errors, try manipulating the max_dist argument.

    library(dplyr)
    library(fuzzyjoin)
    
    df1 %>%
    rename(fullname = Name) %>% #fuzzyjoin expects equal colnames
    fuzzyjoin::stringdist_inner_join(df2,by = "fullname", method = "jaccard", max_dist=0.31)