Search code examples
matplotlibscikit-learnnlpword2vecword-embedding

Word shows up more than once in TSNE plot


When plotting word embedding TSNE results, words show up more than once.

I am reducing dimensionality of a Word2Vec word embedding, but when I plot the results for a subset of the most similar words (manually enter several words for which I want the most similar ones), the same words show up more than once:

from sklearn.manifold import TSNE

words = sum([[k] + v for k, v in similar_words.items()], [])
wvs = model.wv[words]

tsne = TSNE(n_components=3, random_state=0, n_iter=10000, perplexity=29)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

plt.figure(figsize=(16, 12))
plt.scatter(T[:, 0], T[:, 1], c='purple', edgecolors='purple')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')

Is this a normal behavior for PCA and TSNE word similarity dimensionality reduction, or is there something off with my code? Is it possible that the plot is treating each of the similar words subsets as independent from each other?


Solution

  • Each word has two vectors: as a center word and as a context word. Stanford University word2vec lecture starting at 41:37.