Search code examples
pythontext-miningfuzzy-search

Python: Check if the sentence contains any word from List (with fuzzy match)


I would like to extract keywords from a sentence given a list_of_keywords.

I managed to extract the exact words

[word for word in Sentence if word in set(list_of_keywords)]

Is it possible to extract words that have good similarity with the given list_of_keywords, i.e cosine similarity between two words is > 0.8

For example, the keyword in the given list is 'allergy' and now the sentence is written as

'a severe allergic reaction to nuts in the meal she had consumed.'

the cosine distance between 'allergy' and 'allergic' can be calculated as below

cosdis(word2vec('allergy'), word2vec('allergic'))
Out[861]: 0.8432740427115677

How to extract 'allergic' from the sentence as well based on the cosine similarity?


Solution

  • def word2vec(word):
        from collections import Counter
        from math import sqrt
    
        # count the characters in word
        cw = Counter(word)
        # precomputes a set of the different characters
        sw = set(cw)
        # precomputes the "length" of the word vector
        lw = sqrt(sum(c*c for c in cw.values()))
    
        # return a tuple
        return cw, sw, lw
    
    def cosdis(v1, v2):
        # which characters are common to the two words?
        common = v1[1].intersection(v2[1])
        # by definition of cosine distance we have
        return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
    
    
    list_of_keywords = ['allergy', 'something']
    Sentence = 'a severe allergic reaction to nuts in the meal she had consumed.'
    
    threshold = 0.80
    for key in list_of_keywords:
        for word in Sentence.split():
            try:
                # print(key)
                # print(word)
                res = cosdis(word2vec(word), word2vec(key))
                # print(res)
                if res > threshold:
                    print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
            except IndexError:
                pass
    

    OUTPUT:

    Found a word with cosine distance > 80 : allergic with original word: allergy
    

    EDIT:

    one-liner killer:

    print([x for x in Sentence.split() for y in list_of_keywords if cosdis(word2vec(x), word2vec(y)) > 0.8])
    

    OUTPUT:

    ['allergic']