So I computed a matrix of word vectors manually using keras which looks like this:
>>> word_embeddings
0 1 2 3
movie 0.007964 0.004251 -0.049078 0.032954 ...
film -0.006703 0.045888 -0.020975 0.012483 ...
one -0.011733 0.003348 -0.022017 -0.006476 ...
make 0.045888 -0.011219 0.037796 -0.041868 ...
1000 rows × 25 columns
What I want now is get the n
most similar words to a given input word, eg. input='movie'
-> output=['film', 'cinema', ...]
I computed a matrix of euclidian distances, but how do I get the above result?
>>> from sklearn.metrics.pairwise import euclidean_distances
>>> distance_matrix = euclidean_distances(word_embeddings)
array([[0. , 2.4705646, 2.363872 , ..., 3.1345532, 2.9737253,
2.791427 ],
[2.4705646, 0. , 2.3540049, ..., 3.6580865, 3.4589343,
3.494087 ],
[2.363872 , 2.3540049, 0. , ..., 3.9583569, 3.692863 ,
3.5237448],
...,
[3.1345532, 3.6580865, 3.9583569, ..., 0. , 4.0572405,
4.0648513],
[2.9737253, 3.4589343, 3.692863 , ..., 4.0572405, 0. ,
4.156624 ],
[2.791427 , 3.494087 , 3.5237448, ..., 4.0648513, 4.156624 ,
0. ]], dtype=float32)
1000 rows × 1000 columns
try this:
top_k_similar_indexes = np.argsort(distance_matrix, axis=1)[:, :k]
then you will have the indexes of the k top similar words for each row. If you want the indexes of the k top most different words it will be np.argsort(distance_matrix, axis=1)[:, -k:]