I am trying an NLP technique to see similarity between words from two lists.
The code is as below
import en_core_web_sm
nlp = en_core_web_sm.load()
Listalpha = ['Apple', 'Grapes', 'Mango', 'Fig','Orange']
ListBeta = ['Carrot', 'Mango', 'Tomato', 'Potato', 'Lemon']
list_n =" ".join(ListBeta)
doc = nlp(list_n)
list_str = " ".join(Listalpha)
doc2 = nlp(list_str)
newlist = []
for token1 in doc:
for token2 in doc2:
newlist.append((token1.text, token2.text,token1.similarity(token2)))
words_most_similar = sorted(newlist, key=lambda x: x[2], reverse=True)
print(words_most_similar)
I get the following output
[('Mango', 'Mango', 1.0), ('Potato', 'Mango', 0.71168435), ('Lemon', 'Orange', 0.70560765), ('Carrot', 'Mango', 0.670182), ('Tomato', 'Mango', 0.6513121), ('Potato', 'Fig', 0.6306212), ('Tomato', 'Fig', 0.61672616), ('Carrot', 'Apple', 0.6077532), ('Lemon', 'Mango', 0.5978425), ('Mango', 'Fig', 0.5930651), ('Mango', 'Orange', 0.5529714), ('Potato', 'Apple', 0.5516073), ('Potato', 'Orange', 0.5486618), ('Lemon', 'Fig', 0.50294644), ('Mango', 'Apple', 0.48833746), ('Tomato', 'Orange', 0.44175738), ('Mango', 'Grapes', 0.42697987), ('Lemon', 'Apple', 0.42477235), ('Carrot', 'Fig', 0.3984716), ('Carrot', 'Grapes', 0.3944748), ('Potato', 'Grapes', 0.3860814), ('Tomato', 'Apple', 0.38342345), ('Carrot', 'Orange', 0.38251868), ('Tomato', 'Grapes', 0.3763761), ('Lemon', 'Grapes', 0.28998604)]
How do I get an output in the format as below
[('Mango','Mango',1.0),('Mango', 'Fig', 0.5930651), ('Mango', 'Orange', 0.5529714),('Mango', 'Apple', 0.48833746),('Mango', 'Grapes', 0.42697987),('Carrot', 'Mango', 0.670182),('Carrot', 'Apple', 0.6077532)....]
Basically I want the mapping of the form (word in ListBeta, word in Listalpha, cosine score) and it should be uniform and not at random as I have currently. Also it needs to be in descending order of cosine value as depicted above.
If it's indeed question of sorting results, you can use tuples as key result in sorted
, i.e. your lambda could return tuple/list, and python will sort on it element-wise.
words_most_similar = sorted(newlist, key=lambda t: (t[0], -t[2]))