Search code examples
pythonmachine-learningscikit-learncluster-analysishierarchical-clustering

Clustering algorithm performance check on un plot able data


I am using Kmeans Clustring algorithm from Sci-kit learn library and dimension of my data is 169 and that's why I am unable to visualize the result of clustering.

Is there any way to measure the performance of algorithm?

Secondly, I have the labels of data and I want to test the learned model with the test dataset but I am not sure the labels Kmeans algo gave to cluster coincide with the labels I have.


Solution

  • There are ways of visualizing high dimensional data. You can sample some dimensions, use PCA components, MDS, tSNE, parallel coordinates, and many more.

    If you even just read the Wikipedia article on clustering, there is a section on evaluation, including supervised as well as unsupervised evaluation. But the results of such evaluation can be very misleading...

    Bear on mind that if you have labeled data, supervised methods should always outperform unsupervised methods that do not have the labels: they don't know what to look for - there is lie reason to believe that every clustering happens to align with some labels. In particular, on most data there will be many reasonable clusterings that capture different aspects of your data.