Search code examples
pythonscikit-learnnlpdata-visualizationword-embedding

T-SNE visualisation on list of word vectors


I have a list of ~20k word vectors ('tuple_vectors'), with no labels, each looks like the below

[-2.84658718e+00 -7.74899840e-01 -2.24296474e+00 -8.69364500e-01
  3.90927410e+00 -2.65316987e+00 -9.71897244e-01 -2.40408254e+00
  1.16272974e+00 -2.61649752e+00 -2.87350488e+00 -1.06603658e+00
  2.93374014e+00  1.07194626e+00 -1.86619771e+00  1.88549474e-01
 -1.31901133e+00  3.83382154e+00 -3.46174908e+00 ...

is there a quick, concise way to visualise using t-sne?

I've tried with the following

from sklearn.manifold import TSNE

n_sne = 21060


tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(tuple_vectors)
plt(tsne_results)

Solution

  • If you are vectorizing your text first, I suggest using yellowbrick library. Since TSNE is very expensive, TSNEVisualizer in yellowbrick applies a simpler decomposition ahead of time (SVD with 50 components by default), then performs the t-SNE embedding. The visualizer then plots the scatter plot which can be colored by cluster or by class. Here is a simple example using tf-idfvectorizer:

    from yellowbrick.text import TSNEVisualizer
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # vectorize the text
    tfidf  = TfidfVectorizer()
    tuple_vectors = tfidf.fit_transform(sample_text)
    
    # Create the visualizer and draw the vectors
    tsne = TSNEVisualizer()
    tsne.fit(tuple_vectors)
    tsne.poof()