I'm a noob to R and trying to understand the output of clusplot() after running kmeans() on a dataframe. I'm probably missing fundamental information because I can't seem to fully interpret the following image: CLUSPLOT(newseeds), y-axis is component 2, x-axis is component 1
The code for this particular plot graph is:
clusplot(newseeds, kc$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
The code for everything up to this point is:
library("cluster", lib.loc="~/R/win-library/3.3")
#load the data
seeds <- read.csv("E:/Datasets/seeds.csv")
View(seeds)
table(seeds$variety.of.wheat)
#remove the variety.of.wheat
newseeds<-seeds
newseeds$variety.of.wheat<-NULL
head(newseeds)
#make sure that the result is reproducible
set.seed(1234)
#Run the method and store the result in kc variable
kc<-kmeans(newseeds, 3)
#output the result
print(kc)
kc$centers
kc$totss
#cluster to class evaluation
table(seeds$variety.of.wheat, kc$cluster)
#cluster plot
clusplot(newseeds, kc$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
My questions:
The "Components" are principal components. Briefly, they are a standard method used to provide the maximum amount of information about a set of multivariate data with fewer dimensions than the original data contain. Here is an example using the iris
data set available on R:
data(iris)
library(cluster)
iris.clus <- kmeans(iris[, -5], 3)
clusplot(iris[, -5], iris.clus$cluster)
Here is the result:
The iris data consist of four measurements on different species of iris so the data are four dimensional. The principal components summarize the variability in two dimensions (like a shadow collapses a three-dimensional figure into two dimensions). Details will be lost so two of the species seem to overlap. The ellipses summarize the covariance within each group.