Search code examples
python-3.xnumpygensimword2veccosine-similarity

Perform matrix multiplication with cosine similarity function


I have two lists:

list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
          ['scent', 'scents', 'aroma', 'smell', 'odor'],
          ['mental_illness', 'mental_disorders','bipolar_disorder']
          ['romance', 'romances', 'romantic', 'budding_romance']]

list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
          ['also', 'like', 'buy', 'perfumes'],
          ['suffer', 'from', 'clinical', 'depression'],
          ['really', 'love', 'my', 'wife']]

I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.

The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:

import gensim
import numpy as np
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True) 
similarity_mat = np.zeros([len(list_2), len(list_1)])

for i, L2 in enumerate(list_2):
    for j, L1 in enumerate(list_1):
        similarity_mat[i, j] = model.n_similarity(L2, L1)

However, I'd like to implement this with matrix multiplication and no for loops.

My two questions are:

  1. Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
  2. Would it be more efficient and faster using the current method or matrix multiplication?

I hope my question was clear enough, please let me know if I can clarify even further.


Solution

  • There are two problems in code, the second last and last line.

    import gensim
    import numpy as np
    from gensim.models import KeyedVectors
    
    model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True) 
    similarity_mat = np.zeros([len(list_2), len(list_1)])
    
    for i, L2 in enumerate(list_2):
        for j, L1 in enumerate(list_1):
            similarity_mat[i, j] = model.n_similarity(L2, L1)

    Answers to you questions:
    1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
    If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
    2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.