Search code examples
rhierarchical-clustering

How cluster number reveals known class


It is maybe a dumb question, but I can't find anything on the subject.

I have 3 classes (varieties) in my data plant and I performed a cluster analysis. I've obtained the following table when I want to compare clusters to the known classes :

cut.complete <- cutree(cluster.complete,k=3)
cc <- table(variety,cut.complete) 
cc
         cut.complete
variety    1  2  3
  AK      46 13  0
  AF       2 18 50
  GH       0 26 21

How do I know that the cluster 2 is the cluster revealing the known AF class? For example, could cluster 3 reveal AF class?

If cluster 1, cluster 2 and 3 are not revealing true varieties AK, AF and GH respectively , it means I can not use the formula

100*round(sum(diag(cc))/sum(cc), digits=3)

to calculate the percentage of correctly classified samples.

Thank you.


Solution

  • Actually in this case, your cluster label 3 matches with the ground truth variety AF more than it matches with GH, similarly the cluster label 2 matches with the ground truth variety GH more than it matches with AF (use the maximum matches of a cluster label with the ground truth).

    As shown in the following example, the cluster label is matched with the actual (ground truth) class label, where the maximum # data points matched for each row: cluster 3 is matched with class label AK because for the variety AK maximum match in that row was found for the cluster label 3.

    tab
           cut.complete
    variety   1   2   3
         AF 110 125  82
         AK  93 102 130
         GH 129 103 126
    
    library(e1071)
    matchClasses(tab) # find which cluster labels match with which class labels
    
    Cases in matched pairs: 38.4 %
    AF AK GH 
     2  3  1