It is maybe a dumb question, but I can't find anything on the subject.
I have 3 classes (varieties
) in my data plant
and I performed a cluster analysis. I've obtained the following table when I want to compare clusters to the known classes :
cut.complete <- cutree(cluster.complete,k=3)
cc <- table(variety,cut.complete)
cc
cut.complete
variety 1 2 3
AK 46 13 0
AF 2 18 50
GH 0 26 21
How do I know that the cluster 2 is the cluster revealing the known AF class? For example, could cluster 3 reveal AF class?
If cluster 1, cluster 2 and 3 are not revealing true varieties AK, AF and GH respectively , it means I can not use the formula
100*round(sum(diag(cc))/sum(cc), digits=3)
to calculate the percentage of correctly classified samples.
Thank you.
Actually in this case, your cluster label 3 matches with the ground truth variety AF more than it matches with GH, similarly the cluster label 2 matches with the ground truth variety GH more than it matches with AF (use the maximum matches of a cluster label with the ground truth).
As shown in the following example, the cluster label is matched with the actual (ground truth) class label, where the maximum # data points matched for each row: cluster 3 is matched with class label AK because for the variety AK maximum match in that row was found for the cluster label 3.
tab
cut.complete
variety 1 2 3
AF 110 125 82
AK 93 102 130
GH 129 103 126
library(e1071)
matchClasses(tab) # find which cluster labels match with which class labels
Cases in matched pairs: 38.4 %
AF AK GH
2 3 1