Table joins with conditional "fuzzy" string matching in R

I'm attempting to join two tables, one is a smaller table with a column of names of common food items (e.g. "Corn", "Peppers", "Squash"...etc...), and the other is a larger table with specific food names (e.g. "Sweet Corn", "Red Corn", "Baby Corn", "Zucchini Squash", "Orange Squash", "Squash , Large"...etc...). I'm trying to join based on a "fuzzy" condition in which I specify to join on the food names and pull the most frequent code in another column of the larger table (the mode) into a new column in the smaller table.

Here is an example of the smaller table:

Food Name	Food Code
Corn	NA
Squash	NA
Peppers	NA

Here is an example of the larger table:

Food Name	Food Code
Sweet Corn	532
Red Corn	532
Baby Corns	944
Squash	111
Long Squash	123
Red Pepper	654
Green Pepper	655
Red Peppers	654

I've tried the "left_join" function from the dplyr package, but this doesn't seem to work that well with the "fuzzy" string join feature. I know that the tidyverse also has a function to find the mode of grouped variables and I was hoping to use that function, but I am unsure how to incorporate that into the left_join statement. I also discovered the fuzzyjoin package in R, but I am not certain if this is the best option.

My desired output would look like:

Food Name	Food Code
Corn	532
Squash	111
Peppers	654

Solution

I hope this helps you.

In stringdist_join, the max_dist argument is used to constrain the degree of fuzziness.

library(fuzzyjoin)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(knitr)


small_tab = data.frame(Food.Name = c('Corn', 'Squash', 'Peppers'), 
                       Food.Code = c(NA, NA, NA))


large_tab = data.frame(Food.Name = c('Sweet Corn', 'Red Corn', 'Baby Corns', 
                                     'Squash', 'Long Squash', 'Red Pepper', 
                                     'Green Pepper', 'Red Peppers'), 
                       Food.Code = c(532, 532, 944, 111, 123, 654, 655, 654))

joined_tab = stringdist_join(small_tab, large_tab, by = 'Food.Name',
                             ignore_case = TRUE, method = 'cosine', 
                             max_dist = 0.5, distance_col = 'dist') %>%
  
  # Tidy columns 
  select(Food.Name = Food.Name.x, -Food.Name.y, 
         Food.Code = Food.Code.y, -dist) %>%
  
  # Only keep most frequent food code per food name
  group_by(Food.Name) %>% count(Food.Name, Food.Code) %>% 
  slice(which.max(n)) %>% select(-n) %>%
  
  # Order food names as in the small table
  arrange(factor(Food.Name, levels = small_tab$Food.Name))

# Show table with columns renamed
joined_tab %>%
  rename('Food Name' = Food.Name, 
         'Food Code' = Food.Code) %>%
  kable()

Food Name	Food Code
Corn	532
Squash	111
Peppers	654

^{Created on 2023-05-31 with reprex v2.0.2}