I am trying to match two string columns containing food descriptions [foods1
and foods2
]. I applied an algorithm weighting the word frequency so less frequent words have more weight but it fails as it does not recognise objects.
For instance, foods1
item "Bagel with raisins" gets matched to foods2
"salad with raisins" rather than to "bagel" as "raisins" is a less frequent word. However, a "bagel with raisins" is closer to being a "bagel" as an actual object than to a "salad with raisins".
Example in R:
foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))
foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])
require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F),
data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
by.x = "foods1",
by.y = "foods2", match_type = "fuzzy",
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75),
unique_key_1 = "foods1_id",
unique_key_2 = "foods2_id")
Results, see line 3 matching foods1
"bagel with raisins" to foods2
"salad with raisins". Same for last line of foods1
"bagel with raisins and olives" being matched to foods2
"tuna and olives":
fuzzy_results
$matches
foods2_id foods1_id foods1 foods2
1: a 1 bagel plain bagel
2: a 4 bagel with olives bagel
3: c 5 bagel with raisins salad with raisins
4: d 2 bagel with raisins and olives tuna and olives
Is there any fuzzy matching algorithm in R or Python able to understand what objects are being matched? [so "bagel" is recognised as closer to a "bagel with raisins" than a "salad with raisins"].
To expand on my comment, you can try using NLP concepts of word embeddings, which is just a vector/numeric representation of a word or sentence. A simplified meaning of word embedding is that they are generated in a way to kind-of capture semantic meanings between words, so similar words end up in the same cluster.
For a small database like yours it'll probably be overkill, but after generating the embeddings you can use cosine similarity to find which food item is closest to each other.
There are many pre-trained models out there that you can use, though you might have to research a little to find which is most suitable for your use case (you can also fine tune it if you have your data but that's another story).
See an unoptimized python implementation below:
# init
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
sentences1 = ['bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins', 'bagel']
sentences2 = ['bagel','pizza','salad with raisins','tuna and olives']
sentences = sentences1+ sentences2
sentences = list(set(sentences)) # get unique words
# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2') # try different models
# Create embeddings for each sentence
embeddings = model.encode(sentences)
# loop through each word in sentence 1 and compare cosine similarity for words in sentence2, select the one with highest similarity:
indices1 = [sentences.index(i) for i in sentences1]
indices2 = [sentences.index(i) for i in sentences2]
emb1, emb2 = embeddings[indices1], embeddings[indices2]
arr_cos, arr_sent = [], []
for i in range(len(sentences1)):
cos = cosine_similarity(emb1[i].reshape(1,embeddings.shape[1]), emb2).flatten()
idx = np.argmax(cos)
# print(i, idx, cos.shape)
arr_cos.append(cos[idx])
arr_sent.append(sentences2[idx])
print(pd.DataFrame({'sent1': sentences1, 'paired': arr_sent, 'cosine': arr_cos}))
Output:
sent1 paired cosine
0 bagel plain bagel 0.808948
1 bagel with raisins and olives salad with raisins 0.638765
2 hamburger pizza 0.437424
3 bagel with olives bagel 0.686805
4 bagel with raisins bagel 0.707621
5 bagel bagel 1.000000