I have two datasets both with different columns however both with names. I wanted to match the names and concatenate the rows to each other
df1 <- data.frame (Name = c("Smith, Carl", "Jones, Allan", "Waterman, Josh", "Woods, Alex"),
number1 = c(1, 2, 3, 4))
df2 <- data.frame(fullname = c("Carl Smith", "Allan Jones", "Josh Waterman", "Alex Woods"),
research = c("genetics", "genetics", "mathematics", "plant"))
I would like to combine the dataset into
Name | Number1 | Research | fullname |
---|---|---|---|
Smith, Carl | 1 | genetics | Carl Smith |
Jones, Allan | 2 | genetics | Allan Jones |
Waterman, Josh | 3 | mathematics | Josh Waterman |
Woods, Alex | 4 | plant | Alex Woods |
I have tried merge(df1, df2, by.x = "Name", by.y = "fullname", all = TRUE)
but it doesn't match the partial names, it gives:
Name | Number1 | Research | fullname |
---|---|---|---|
Smith, Carl | 1 | - | - |
Jones, Allan | 2 | - | - |
Waterman, Josh | 3 | - | - |
Woods, Alex | 4 | - | - |
- | - | genetics | Carl Smith |
- | - | genetics | Allan Jones |
- | - | mathematics | Josh Waterman |
- | - | plant | Alex Woods |
This could be solved by a fuzzy (=not needing perfectly matching conditions) join of both datasets. This is different from merge as it joins two tables based on fuzzy string matching. Beware that this does not always work perfectly and may produce incorrectly joined rows. If the join is producing errors, try manipulating the max_dist
argument.
library(dplyr)
library(fuzzyjoin)
df1 %>%
rename(fullname = Name) %>% #fuzzyjoin expects equal colnames
fuzzyjoin::stringdist_inner_join(df2,by = "fullname", method = "jaccard", max_dist=0.31)