cluster-analysis cross-validation hierarchical-clustering

How do you get a cluster from a sample with hierarchical clustering?

In order to find a cluster which an unseen sample belongs to,

k-means stores centroids for each cluster. Simply cluster with closest centroid is the cluster the new sample belongs to.

Then what about hierarchical clustering? how would you find a cluster the new sample belongs to?

Similarly, in case of co-clustering, we only get the cluster id for rows and column (separately) of the training data after clustering.

In other words, given a sample with m features (columns), we need to somehow finds the cluster that each features belongs to. Can anyone explain to me how this is achieved in practice? If my assumption is incorrect then, can you guide me to the correct direction?

Thanks

Solution

You don't.

It's not the purpose of clustering to label new data points. K-means is somewhat an exception because it's obvious what rule to use (nearest center), but even for k-means the result of labeling the point this way will not necessarily be the same as running kmeans(X u {x}) on the old data plus the new point. So it is not consistent.

For other algorithms such as hierarchical clustering this effect is worse. A single new data point could cause two clusters to merge, for example!

What you can do - and what seems to be the common solution - is to use the clustering output to train a classifier. This classifier can then be used to predict cluster labels. A slow but common choice would be the (k=1-) nearest neighbor classifier.