Search code examples
rmachine-learningplotcluster-analysisk-means

Need help interpreting clusplot() graph for kmeans data analysis


I'm a noob to R and trying to understand the output of clusplot() after running kmeans() on a dataframe. I'm probably missing fundamental information because I can't seem to fully interpret the following image: CLUSPLOT(newseeds), y-axis is component 2, x-axis is component 1

The code for this particular plot graph is:

clusplot(newseeds, kc$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) 

The code for everything up to this point is:

library("cluster", lib.loc="~/R/win-library/3.3")

#load the data
seeds <- read.csv("E:/Datasets/seeds.csv")
View(seeds)
table(seeds$variety.of.wheat)
#remove the variety.of.wheat
newseeds<-seeds
newseeds$variety.of.wheat<-NULL
head(newseeds)
#make sure that the result is reproducible
set.seed(1234)
#Run the method and store the result in kc variable
kc<-kmeans(newseeds, 3)
#output the result
print(kc)
kc$centers
kc$totss

#cluster to class evaluation
table(seeds$variety.of.wheat, kc$cluster)
#cluster plot
clusplot(newseeds, kc$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

My questions:

  • What are Component 1 and Component 2?
  • I know the different bubbles are clusters, but why do they overlap?

Solution

  • The "Components" are principal components. Briefly, they are a standard method used to provide the maximum amount of information about a set of multivariate data with fewer dimensions than the original data contain. Here is an example using the iris data set available on R:

    data(iris)
    library(cluster)
    iris.clus <- kmeans(iris[, -5], 3)
    clusplot(iris[, -5], iris.clus$cluster)
    

    Here is the result:

    Clusplot

    The iris data consist of four measurements on different species of iris so the data are four dimensional. The principal components summarize the variability in two dimensions (like a shadow collapses a three-dimensional figure into two dimensions). Details will be lost so two of the species seem to overlap. The ellipses summarize the covariance within each group.