Search code examples
rdplyrstringdist

Merging two dataframes by stringmatch with dplyr and stringdist


I'm attempting to do a dplyr left join on two dataframes based on greatly similar language (that's not exact).

DF1:

title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121

DF2:

showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123

I execute this to get the string distances as a vector using the stringdist package/library:

titlematch <- amatch(df1$title,df2$showname) 

The vector looks like... well, an integer vector:

titlematch
1
2
NA

Normally if I had exact matches I'd do:

blended <- left_join(df1, df2, by = c("title" = "showname"))

How do I do the left join using the vector as the record selector so that the end outcome would be:

title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89

With the third non-match excluded because there's not a probable match in the vector (NA).


Solution

  • camille suggested in a comment:

    Have you looked at fuzzyjoin?

    I had never heard of fuzzyjoin before, but I tried it and loved it. stringdist_left_join was exactly what I needed.