I'm attempting to do a dplyr left join on two dataframes based on greatly similar language (that's not exact).
DF1:
title | records
Bob's show, part 1 | 42
Time for dinner | 77
Horsecrap | 121
DF2:
showname | counts
Bob's show part 1 | 772
Dinner time | 89
No way Jose | 123
I execute this to get the string distances as a vector using the stringdist package/library:
titlematch <- amatch(df1$title,df2$showname)
The vector looks like... well, an integer vector:
titlematch
1
2
NA
Normally if I had exact matches I'd do:
blended <- left_join(df1, df2, by = c("title" = "showname"))
How do I do the left join using the vector as the record selector so that the end outcome would be:
title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89
With the third non-match excluded because there's not a probable match in the vector (NA).
camille suggested in a comment:
Have you looked at
fuzzyjoin
?
I had never heard of fuzzyjoin
before, but I tried it and loved it. stringdist_left_join
was exactly what I needed.