How can I measure accuracy in Hierarchical Clustering (Single link) in R with 2 Clusters ? Here is my code:
> dcdata = read.csv("kkk.txt")
> target = dcdata[,3]
> dcdata = dcdata [,1:2]
> d = dist(dcdata)
> hc_single = hclust(d,method="single")
> plot(hc_single)
> clusters =cutree(hc_single, k=2)
> print(clusters)
Thanks!
Accuracy is not the most accurate term, but I guess you want to see whether the hierarchical clustering gives you clusters or groups that coincide with your labels. For example, I use the iris dataset, and use setosa vs others as target:
data = iris
target = ifelse(data$Species=="setosa","setosa","others")
table(target)
others setosa
100 50
data = data[,1:4]
d = dist(data)
hc_single = hclust(d,method="single")
plot(hc_single)
Seems like they are two major clusters. Now we try to see how the target are distributed:
library(dendextend)
dend <- as.dendrogram(hc_single)
COLS = c("turquoise","orange")
names(COLS) = unique(target)
dend <- color_labels(dend, col = COLS[target[labels(dend)]])
plot(dend)
Now like what you did, we get the clusters,
clusters =cutree(hc_single, k=2)
table(clusters,target)
target
clusters others setosa
1 0 50
2 100 0
You get an almost perfect separation. All the data points in cluster 1 are setosa and all in cluster 2 are not setosa. So you can think of it as like 100% accuracy but I would be careful about using the term.
You can roughly calculate the coincidence like this:
Majority_class = tapply(factor(target),clusters,function(i)names(sort(table(i)))[2])
This tells you for each cluster, which is the majority class. And from there we see how much this agrees with the actual labels.
mean(Majority_class[clusters] == target)