I have a dataset
and I want to apply K-means clustering
to make groups. But, I only want to consider pairs of variables.
The dataset
has a class variable, so I want this class variable not to take part in the clustering and use it to evaluate algorithm performance.
I want to do it automatically so all possible combinations of two variables must be tried and only the best one returned.
How can I do this in R? You can use Iris dataset as an example.
Welcome to SO! What about something like this, to have all the models (and everything about them, to have only the best combination, look the bottom of the answer):
# first the pairwise combination of column, without the labels
comb <- combn(names(iris[,-5]),2,simplify=FALSE)
# an empty list to populate with kmeans
listed <- list()
Then a for loop that apply the kmeans to each subset, and put the output in the list:
for (i in c(1:length(comb))){
names_ <- comb[[i]]
df <-iris[ , which(names(iris) %in% names_)]
listed[[i]] <- kmeans(df,3)
}
As example, here
listed[[2]]
K-means clustering with 3 clusters of sizes 51, 58, 41
Cluster means:
Sepal.Length Petal.Length
1 5.007843 1.492157
2 5.874138 4.393103
3 6.839024 5.678049
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2
[66] 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3
[131] 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Within cluster sum of squares by cluster:
[1] 9.893725 23.508448 20.407805
(between_SS / total_SS = 90.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
In case you only want the "best" model, in this case the one with the best purity index (note: I've never used it, so check the formula )ratio, here another loop:
# combinations
comb <- combn(names(iris[,-5]),2,simplify=FALSE)
# another list
listed_1 <- list()
library(dplyr) # external package to make it simpler
for (i in c(1:length(comb))){
names_ <- comb[[i]]
df <-iris[ , which(names(iris) %in% names_)]
km <- kmeans(df,3)
df <- data.frame(cl = km$cluster, spec =iris$Species, cnt = 1)
df <- aggregate(df$cnt,list(cl = df$cl,spec= df$spec),sum )
df <- df %>% group_by(spec) %>% filter(x == max(x))
listed_1[[i]] <- round(sum(df$x)/nrow(iris),2)*100
}
Now you got a list as result: the following commands are going to put together (cbind
) in a data.frame the list of the result (do.call(rbind, listed_1)
) and the combinations (do.call(rbind, comb)
):
res <- cbind(do.call(rbind, listed_1),do.call(rbind, comb))
res[which.max(res[,1]),]
[1] "95" "Petal.Length" "Petal.Width"