Search code examples
rk-meansdistributionhierarchical-clusteringconfusion-matrix

Confusion matrix using table in k-means and hierarchical clustering


I have some problems with calculating of confusion matrix. I have created three sets of points by multivariate normal distibution:

library('MASS')
library('ggplot2')
library('reshape2')
library("ClusterR")
library("cluster")
library("dplyr")
library ("factoextra")
library("dendextend")
library("circlize")

mu1<-c(1,1)
mu2<-c(1,-9)
mu3<-c(-7,-2)

sigma1<-matrix(c(1,1,1,2), nrow=2, ncol=2, byrow = TRUE)
sigma2<-matrix(c(1,-1,-1,2), nrow=2, ncol=2, byrow = TRUE)
sigma3<-matrix(c(2,0.5,0.5,0.3), nrow=2, ncol=2, byrow = TRUE)

simulation1<-mvrnorm(100,mu1,sigma1)
simulation2<-mvrnorm(100,mu2,sigma2)
simulation3<-mvrnorm(100,mu3,sigma3)

X<-rbind(simulation1,simulation2,simulation3)
colnames(X)<-c("x","y")
X<-data.frame(X)

I have also constructed clusters using k-means clustering and hierarchical clustering with k initial centers (k=3):

//k-means clustering
    k<-3
    B<-kmeans(X, centers = k, nstart = 10)
    x_cluster = data.frame(X, group=factor(B$cluster))
    ggplot(x_cluster, aes(x, y, color = group)) + geom_point()

//hierarchical clustering
    single<-hclust(dist(X), method = "single")
    clusters2<-cutree(single, k = 3)
    fviz_cluster(list (data = X, cluster=clusters2))

How can I calculate confusion matrix for full dataset(X) using table in both of these cases?


Solution

  • Using your data, insert set.seed(42) just before you create sigma1 so that we have a reproducible example. Then after you created X:

    X.df <- data.frame(Grp=rep(1:3, each=100), x=X[, 1], y=X[, 2])
    k <- 3
    B <- kmeans(X, centers = k, nstart = 10)
    table(X.df$Grp, B$cluster)
    # 
    #       1   2   3
    #   1   1   0  99
    #   2   0 100   0
    #   3 100   0   0
    

    Original group 1 is identified as group 3 with one specimen assigned to group 1. Original group 2 is assigned to group 2 and original group 3 is assigned to group 1. The group numbers are irrelevant. The classification is perfect if each row/column contains all values in a single cell. In this case only 1 specimen was missplaced.

    single <- hclust(dist(X), method = "single")
    clusters2 <- cutree(single, k = 3)
    table(X.df$Grp, clusters2)
    #    clusters2
    #       1   2   3
    #   1  99   1   0
    #   2   0   0 100
    #   3   0 100   0
    

    The results are the same, but the cluster numbers are different. One specimen from the original group 1 was assigned to the same group as the group 3 specimens. To compare these results:

    table(Kmeans=B$cluster, Hierarch=clusters2)
    #       Hierarch
    # Kmeans   1   2   3
    #      1   0 101   0
    #      2   0   0 100
    #      3  99   0   0
    

    Notice that each row/column contains only one cell that is nonzero. The two cluster analyses agree with one another even though the cluster designations differ.

    D <- lda(Grp~x + y, X.df)
    table(X.df$Grp, predict(D)$class)
    #    
    #       1   2   3
    #   1  99   0   1
    #   2   0 100   0
    #   3   0   0 100
    

    Linear discriminant analysis tries to predict the specimen number given the values of x and y. Because of this, the cluster numbers are not arbitrary and the correct predictions all fall on the diagonal of the table. This is what is usually described as a confusion matrix.