Search code examples
pythondata-visualizationtensorboardcosine-similaritydoc2vec

Tensorboard embedding visualization: what is cosine distance?


I'm PhD student in digital humanities. I'm quite new to programming languages.

I have a problem that is freaking me out since last month. I'm trying to visualize a doc2vec model (python, gensim library) on the embeddings projector in Tensorboard but I'm not getting what I expect.

I'm sure that I'm missing out something really basic here...however, summing up

  1. If I pick up a random vector in Tensorboard the most similar vectors are completely different than in my model. Is that because of the dimensionality reduction or what?
  2. A lot of vectors have cosine similarity that is higher than one and I really don't understand what I'm doing wrong here. Someone told me that maybe my vectors are not normalized but I think Gensim does it already, doesn't it?

Here is the code I'm using to generate the embeddings. I tried also to change a bit the code, taking the vectors directly from "KeyedVectors" but nothing changed.

from gensim.scripts import word2vec2tensor
from gensim.models.doc2vec import Doc2Vec
doc2vec_model = Doc2Vec.load("doc2vec4.d2v")
doc2vec_model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)
%run "C:..word2vec2tensor.py" -i doc_tensor.w2v -o my_plot

What I'm doing wrong here? Thanks in advance.


Solution

  • Cosine distance is defined by 1-cosine_similarity, since cosine_similarity is in the interval [-1, 1], cosine_distance lies in [0, 2]. It is therefore normal that some distances are higher than 1. This is true for vectors that point in different directions.

    As for your first question, since in your link, the explained variance of the PCA is ~8.5%, it is probable that the dimension reduction changes the neighbours of a given vector. You may want to try to reduce the dimension in your model too. Without more information on what your model is, it is hard to be more specific.