Search code examples
pythonnlpcosine-similarity

Is there a way to vectorize only words i.e not from a corpus or bag of words in python?


My use case is to vectorize words in two lists like below.

ListA = [Japan, Electronics, Manufacturing, Science]

ListB = [China, Electronics, AI, Software, Science]

I understand that word2vec and Glove can vectorize words but they do that through corpus or bag of words i.e we have to pass sentences which gets broken down to tokens and then it is vectorized.

Is there a way to just vectorize words in a list?

PS. I am new to NLP side of things, hence pardon any obvious points stated.


Solution

  • I am assuming you wish to see the top 3 most similar words in ListA to for each word in ListB. If so, here is your solution (and if you want all top similar word to words in ListB, I added an optional line for that too):

    import spacy
    
    nlp = spacy.load('en_core_web_md')
    tokensA = nlp(' '.join(ListA))
    # use if wanting tokens in ListB compared to all tokens present: tokensA = nlp(' '.join(ListA+ListB))
    tokensB = nlp(' '.join(ListB))
    
    output_mapping = {tokenB.text: [] for tokenB in tokensB}
    for tokenB in tokensB:
        for tokenA in tokensA:
            # add the tuple to the current list & sort by similarity
            output_mapping[tokenB.text].append((tokenA.text, tokenB.similarity(tokenA)))
            output_mapping[tokenB.text] = list(sorted(output_mapping[tokenB.text], key=lambda x: x[1], reverse=True))
    
    for tokenB in sorted(output_mapping.keys()):
        # print token from listB and the top 3 similarities to list A, sorted
        print(tokenB, output_mapping[key][:3])