Search code examples
rhierarchical-clustering

Measure Accuracy in Hierarchical Clustering (Single link) in R


How can I measure accuracy in Hierarchical Clustering (Single link) in R with 2 Clusters ? Here is my code:

> dcdata = read.csv("kkk.txt")
> target = dcdata[,3]
> dcdata = dcdata [,1:2]
> d = dist(dcdata)
> hc_single = hclust(d,method="single")
> plot(hc_single)
> clusters =cutree(hc_single, k=2)
> print(clusters)

Thanks!


Solution

  • Accuracy is not the most accurate term, but I guess you want to see whether the hierarchical clustering gives you clusters or groups that coincide with your labels. For example, I use the iris dataset, and use setosa vs others as target:

    data = iris
    target = ifelse(data$Species=="setosa","setosa","others")
    table(target)
    others setosa 
       100     50
    
    data = data[,1:4]
    d = dist(data)
    hc_single = hclust(d,method="single")
    plot(hc_single)
    

    enter image description here

    Seems like they are two major clusters. Now we try to see how the target are distributed:

    library(dendextend)
    dend <- as.dendrogram(hc_single)
    COLS = c("turquoise","orange")
    names(COLS) = unique(target)
    dend <- color_labels(dend, col = COLS[target[labels(dend)]])
    plot(dend) 
    

    enter image description here

    Now like what you did, we get the clusters,

    clusters =cutree(hc_single, k=2)
    table(clusters,target)
    
                target
        clusters others setosa
               1      0     50
               2    100      0
    

    You get an almost perfect separation. All the data points in cluster 1 are setosa and all in cluster 2 are not setosa. So you can think of it as like 100% accuracy but I would be careful about using the term.

    You can roughly calculate the coincidence like this:

    Majority_class = tapply(factor(target),clusters,function(i)names(sort(table(i)))[2])
    

    This tells you for each cluster, which is the majority class. And from there we see how much this agrees with the actual labels.

    mean(Majority_class[clusters] == target)