My use case is to vectorize words in two lists like below.
ListA = [Japan, Electronics, Manufacturing, Science]
ListB = [China, Electronics, AI, Software, Science]
I understand that word2vec
and Glove
can vectorize words but they do that through corpus or bag of words i.e we have to pass sentences which gets broken down to tokens and then it is vectorized.
Is there a way to just vectorize words in a list?
PS. I am new to NLP side of things, hence pardon any obvious points stated.
I am assuming you wish to see the top 3 most similar words in ListA
to for each word in ListB
. If so, here is your solution (and if you want all top similar word to words in ListB
, I added an optional line for that too):
import spacy
nlp = spacy.load('en_core_web_md')
tokensA = nlp(' '.join(ListA))
# use if wanting tokens in ListB compared to all tokens present: tokensA = nlp(' '.join(ListA+ListB))
tokensB = nlp(' '.join(ListB))
output_mapping = {tokenB.text: [] for tokenB in tokensB}
for tokenB in tokensB:
for tokenA in tokensA:
# add the tuple to the current list & sort by similarity
output_mapping[tokenB.text].append((tokenA.text, tokenB.similarity(tokenA)))
output_mapping[tokenB.text] = list(sorted(output_mapping[tokenB.text], key=lambda x: x[1], reverse=True))
for tokenB in sorted(output_mapping.keys()):
# print token from listB and the top 3 similarities to list A, sorted
print(tokenB, output_mapping[key][:3])