Search code examples
rk-means

Pairwise K-Means in R


I have a dataset and I want to apply K-means clustering to make groups. But, I only want to consider pairs of variables.

The dataset has a class variable, so I want this class variable not to take part in the clustering and use it to evaluate algorithm performance.

I want to do it automatically so all possible combinations of two variables must be tried and only the best one returned.

How can I do this in R? You can use Iris dataset as an example.


Solution

  • Welcome to SO! What about something like this, to have all the models (and everything about them, to have only the best combination, look the bottom of the answer):

    # first the pairwise combination of column, without the labels
    comb <- combn(names(iris[,-5]),2,simplify=FALSE)
    # an empty list to populate with kmeans
    listed <- list()
    

    Then a for loop that apply the kmeans to each subset, and put the output in the list:

    for (i in c(1:length(comb))){
      names_ <- comb[[i]]
      df <-iris[ , which(names(iris) %in% names_)]
      listed[[i]] <- kmeans(df,3)
      }
    

    As example, here

    listed[[2]]
    K-means clustering with 3 clusters of sizes 51, 58, 41
    
    Cluster means:
      Sepal.Length Petal.Length
    1     5.007843     1.492157
    2     5.874138     4.393103
    3     6.839024     5.678049
    
    Clustering vector:
      [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2
     [66] 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3
    [131] 3 3 3 3 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
    
    Within cluster sum of squares by cluster:
    [1]  9.893725 23.508448 20.407805
     (between_SS / total_SS =  90.5 %)
    
    Available components:
    
    [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"        
    [9] "ifault"
    

    In case you only want the "best" model, in this case the one with the best purity index (note: I've never used it, so check the formula )ratio, here another loop:

    # combinations
    comb <- combn(names(iris[,-5]),2,simplify=FALSE)
    # another list
    listed_1 <- list()
    
    library(dplyr) # external package to make it simpler
    for (i in c(1:length(comb))){
      names_ <- comb[[i]]
      df <-iris[ , which(names(iris) %in% names_)]
      km <- kmeans(df,3)
      df <- data.frame(cl = km$cluster, spec =iris$Species, cnt = 1)
      df <- aggregate(df$cnt,list(cl = df$cl,spec= df$spec),sum )
      df <- df %>% group_by(spec) %>% filter(x == max(x)) 
      listed_1[[i]] <- round(sum(df$x)/nrow(iris),2)*100
      } 
    

    Now you got a list as result: the following commands are going to put together (cbind) in a data.frame the list of the result (do.call(rbind, listed_1)) and the combinations (do.call(rbind, comb)):

    res <- cbind(do.call(rbind, listed_1),do.call(rbind, comb))
    res[which.max(res[,1]),]
    [1] "95"           "Petal.Length" "Petal.Width"