Search code examples
pythonpandaskerasword2veceuclidean-distance

Get most similar words for matrix of word vectors


So I computed a matrix of word vectors manually using keras which looks like this:

>>> word_embeddings

        0           1           2           3 
movie   0.007964    0.004251    -0.049078   0.032954    ...
film    -0.006703   0.045888    -0.020975   0.012483    ...
one     -0.011733   0.003348    -0.022017   -0.006476   ...
make    0.045888    -0.011219   0.037796    -0.041868   ...

1000 rows × 25 columns

What I want now is get the n most similar words to a given input word, eg. input='movie' -> output=['film', 'cinema', ...]

I computed a matrix of euclidian distances, but how do I get the above result?

>>> from sklearn.metrics.pairwise import euclidean_distances
>>> distance_matrix = euclidean_distances(word_embeddings)

array([[0.       , 2.4705646, 2.363872 , ..., 3.1345532, 2.9737253,
        2.791427 ],
       [2.4705646, 0.       , 2.3540049, ..., 3.6580865, 3.4589343,
        3.494087 ],
       [2.363872 , 2.3540049, 0.       , ..., 3.9583569, 3.692863 ,
        3.5237448],
       ...,
       [3.1345532, 3.6580865, 3.9583569, ..., 0.       , 4.0572405,
        4.0648513],
       [2.9737253, 3.4589343, 3.692863 , ..., 4.0572405, 0.       ,
        4.156624 ],
       [2.791427 , 3.494087 , 3.5237448, ..., 4.0648513, 4.156624 ,
        0.       ]], dtype=float32)

1000 rows × 1000 columns

Solution

  • try this:

    top_k_similar_indexes = np.argsort(distance_matrix, axis=1)[:, :k]
    

    then you will have the indexes of the k top similar words for each row. If you want the indexes of the k top most different words it will be np.argsort(distance_matrix, axis=1)[:, -k:]