I would like to extract keywords from a sentence given a list_of_keywords.
I managed to extract the exact words
[word for word in Sentence if word in set(list_of_keywords)]
Is it possible to extract words that have good similarity with the given list_of_keywords, i.e cosine similarity between two words is > 0.8
For example, the keyword in the given list is 'allergy' and now the sentence is written as
'a severe allergic reaction to nuts in the meal she had consumed.'
the cosine distance between 'allergy' and 'allergic' can be calculated as below
cosdis(word2vec('allergy'), word2vec('allergic'))
Out[861]: 0.8432740427115677
How to extract 'allergic' from the sentence as well based on the cosine similarity?
def word2vec(word):
from collections import Counter
from math import sqrt
# count the characters in word
cw = Counter(word)
# precomputes a set of the different characters
sw = set(cw)
# precomputes the "length" of the word vector
lw = sqrt(sum(c*c for c in cw.values()))
# return a tuple
return cw, sw, lw
def cosdis(v1, v2):
# which characters are common to the two words?
common = v1[1].intersection(v2[1])
# by definition of cosine distance we have
return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]
list_of_keywords = ['allergy', 'something']
Sentence = 'a severe allergic reaction to nuts in the meal she had consumed.'
threshold = 0.80
for key in list_of_keywords:
for word in Sentence.split():
try:
# print(key)
# print(word)
res = cosdis(word2vec(word), word2vec(key))
# print(res)
if res > threshold:
print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
except IndexError:
pass
OUTPUT:
Found a word with cosine distance > 80 : allergic with original word: allergy
EDIT:
one-liner killer:
print([x for x in Sentence.split() for y in list_of_keywords if cosdis(word2vec(x), word2vec(y)) > 0.8])
OUTPUT:
['allergic']