Search code examples
pythonrfuzzy-comparison

Python or R Context aware fuzzy matching


I am trying to match two string columns containing food descriptions [foods1 and foods2]. I applied an algorithm weighting the word frequency so less frequent words have more weight but it fails as it does not recognise objects.

For instance, foods1 item "Bagel with raisins" gets matched to foods2 "salad with raisins" rather than to "bagel" as "raisins" is a less frequent word. However, a "bagel with raisins" is closer to being a "bagel" as an actual object than to a "salad with raisins".

Example in R:

foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))

foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])

require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F), 
                           data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
                           by.x = "foods1",
                           by.y = "foods2", match_type = "fuzzy",  
                           fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75), 
                           unique_key_1 = "foods1_id",
                           unique_key_2 = "foods2_id")

Results, see line 3 matching foods1 "bagel with raisins" to foods2 "salad with raisins". Same for last line of foods1 "bagel with raisins and olives" being matched to foods2 "tuna and olives":

fuzzy_results
$matches
   foods2_id foods1_id                        foods1             foods2
1:         a         1                   bagel plain              bagel
2:         a         4             bagel with olives              bagel
3:         c         5            bagel with raisins salad with raisins
4:         d         2 bagel with raisins and olives    tuna and olives

Is there any fuzzy matching algorithm in R or Python able to understand what objects are being matched? [so "bagel" is recognised as closer to a "bagel with raisins" than a "salad with raisins"].


Solution

  • To expand on my comment, you can try using NLP concepts of word embeddings, which is just a vector/numeric representation of a word or sentence. A simplified meaning of word embedding is that they are generated in a way to kind-of capture semantic meanings between words, so similar words end up in the same cluster.

    For a small database like yours it'll probably be overkill, but after generating the embeddings you can use cosine similarity to find which food item is closest to each other.

    There are many pre-trained models out there that you can use, though you might have to research a little to find which is most suitable for your use case (you can also fine tune it if you have your data but that's another story).

    See an unoptimized python implementation below:

    # init
    # !pip install -U sentence-transformers
    from sentence_transformers import SentenceTransformer, util
    from scipy import spatial
    from sklearn.metrics.pairwise import cosine_similarity
    import pandas as pd
    import numpy as np
    
    sentences1 = ['bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins', 'bagel']
    sentences2 = ['bagel','pizza','salad with raisins','tuna and olives']
    sentences = sentences1+ sentences2
    sentences = list(set(sentences)) # get unique words
    
    # Initialize the model
    model = SentenceTransformer('all-MiniLM-L6-v2') # try different models
    
    # Create embeddings for each sentence
    embeddings = model.encode(sentences)
    
    # loop through each word in sentence 1 and compare cosine similarity for words in sentence2, select the one with highest similarity:
    indices1 = [sentences.index(i) for i in sentences1]
    indices2 = [sentences.index(i) for i in sentences2]
    emb1, emb2 = embeddings[indices1], embeddings[indices2]
    
    arr_cos, arr_sent = [], []
    for i in range(len(sentences1)):
      cos = cosine_similarity(emb1[i].reshape(1,embeddings.shape[1]), emb2).flatten()
      idx = np.argmax(cos)
      # print(i, idx, cos.shape)
      arr_cos.append(cos[idx])
      arr_sent.append(sentences2[idx])
    
    print(pd.DataFrame({'sent1': sentences1, 'paired': arr_sent, 'cosine': arr_cos}))
    

    Output:

                               sent1              paired    cosine
    0                    bagel plain               bagel  0.808948
    1  bagel with raisins and olives  salad with raisins  0.638765
    2                      hamburger               pizza  0.437424
    3              bagel with olives               bagel  0.686805
    4             bagel with raisins               bagel  0.707621
    5                          bagel               bagel  1.000000