I'm PhD student in digital humanities. I'm quite new to programming languages.
I have a problem that is freaking me out since last month. I'm trying to visualize a doc2vec model (python, gensim library) on the embeddings projector in Tensorboard but I'm not getting what I expect.
I'm sure that I'm missing out something really basic here...however, summing up
Here is the code I'm using to generate the embeddings. I tried also to change a bit the code, taking the vectors directly from "KeyedVectors" but nothing changed.
from gensim.scripts import word2vec2tensor
from gensim.models.doc2vec import Doc2Vec
doc2vec_model = Doc2Vec.load("doc2vec4.d2v")
doc2vec_model.save_word2vec_format('doc_tensor.w2v', doctag_vec=True, word_vec=False)
%run "C:..word2vec2tensor.py" -i doc_tensor.w2v -o my_plot
What I'm doing wrong here? Thanks in advance.
Cosine distance is defined by 1-cosine_similarity
, since cosine_similarity
is in the interval [-1, 1]
, cosine_distance
lies in [0, 2]
. It is therefore normal that some distances are higher than 1. This is true for vectors that point in different directions.
As for your first question, since in your link, the explained variance of the PCA is ~8.5%
, it is probable that the dimension reduction changes the neighbours of a given vector. You may want to try to reduce the dimension in your model too. Without more information on what your model is, it is hard to be more specific.