Search code examples

How we can check if TSNE results are real when we cluster data?

I am apply TSNE for dimensionality reduction. I have several features that I reduce to 2 features. After, I use Kmeans to cluster the data. Finally, I use seaborn to plot the clustering results.

To import TSNE I use:

from sklearn.manifold import TSNE

To Apply TSNE I use :

features_tsne_32= TSNE(2).fit_transform(standarized_data)

After that I use Kmeans:

kmeans = KMeans(n_clusters=6, **kmeans_kwargs)
km_tsne_32 = kmeans.predict(features_tsne_32)

Finally, I have the plot by using:

import seaborn as sns

#plot data with seaborn

facet = sns.lmplot(data=df, x='km_tsne_32_c1', y='km_tsne_32_c2', hue='km_tsne_32', 
                       fit_reg=False, legend=True, legend_out=True)

I have this plot:

enter image description here

This plot seems to be too perfect and globular it is something wrong with the procedure I follow to plot this data? in the code describe above?


  • Your problem is not specific to t-SNE, but rather to any unsupervised learning algorithm. How do you evaluate its results?

    I would say that the only proper way to do this is if you have some prior or expert knowledge on the data. Something like labels, other metadata, even user feedback.

    That being said, regarding your specific plot:

    1. The fact that you get a continuous "pie" rather than some discrete structure like "islands" or "spaghetti" from tSNE is likely indicative that the projection is not very-well learned. Usually tSNE is supposed to create semi-distinct groups of similar datapoints. This shape looks like an over-leguralized model. (like a VAE with high KL-divergence coefficient).
    2. k-Means produces exactly the partitioning one would expect: The cluster assignment of k-means implicitly creates a Voronoi diagram over the feature space with the cells being the cluster centroids. And a good initialization would produce initial centroids spread out in the feature space. Since that space is symmetrical, then the centroids will probably be as well.

    So k-Means is fine, but you probably need to tweak the parameters of t-SNE.