python scikit-learn nlp data-visualization word-embedding

T-SNE visualisation on list of word vectors

I have a list of ~20k word vectors ('tuple_vectors'), with no labels, each looks like the below

[-2.84658718e+00 -7.74899840e-01 -2.24296474e+00 -8.69364500e-01
  3.90927410e+00 -2.65316987e+00 -9.71897244e-01 -2.40408254e+00
  1.16272974e+00 -2.61649752e+00 -2.87350488e+00 -1.06603658e+00
  2.93374014e+00  1.07194626e+00 -1.86619771e+00  1.88549474e-01
 -1.31901133e+00  3.83382154e+00 -3.46174908e+00 ...

is there a quick, concise way to visualise using t-sne?

I've tried with the following

from sklearn.manifold import TSNE

n_sne = 21060


tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(tuple_vectors)
plt(tsne_results)

Solution

If you are vectorizing your text first, I suggest using yellowbrick library. Since TSNE is very expensive, TSNEVisualizer in yellowbrick applies a simpler decomposition ahead of time (SVD with 50 components by default), then performs the t-SNE embedding. The visualizer then plots the scatter plot which can be colored by cluster or by class. Here is a simple example using tf-idfvectorizer:

from yellowbrick.text import TSNEVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorize the text
tfidf  = TfidfVectorizer()
tuple_vectors = tfidf.fit_transform(sample_text)

# Create the visualizer and draw the vectors
tsne = TSNEVisualizer()
tsne.fit(tuple_vectors)
tsne.poof()