Search code examples
rlabelcluster-analysis

Changing cluster labels for comparison purposes


I need help in redefining the indexes of two clustering procedures in order for them to be comparable in a more straightforward manner.

Suppose that a clustering procedure A gives you the following vector as output (vector of cluster label for each individual)

clust1 <- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)

While the clustering algorithm B return the following vector

clust2 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)

As you can see the two algorithms returned the same clustering but it is not easy to get this if you have hundreds of observations.

Can you help me in develop an automatic function (or a piece of code written in a general way) that changes the cluster labels of either both or one of the two so that they have the same labels?

My main purpose is not comparing the two clustering but I need a code that does what I have said and therefore please don't try to solve my problem just saying that I can compare them with a plot or a contingency table.

Thanks in advance!


Solution

  • clust1 <- c(1, 1, 1, 1, 3, 2, 2, 1, 1, 2, 3, 2, 2)
    clust2 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 2)
    

    Here is a solution that works as long the number of clusters is the same between the two solutions. We are using factor() to apply the labels of clust1 to clust2.

    clust2_re <- 
      factor(clust2,
           levels = unique(clust2),
           labels = unique(clust1)) |> 
      as.character() |> 
      as.numeric()
    
    clust2_re
    #>  [1] 1 1 1 1 3 2 2 1 1 2 3 2 2
    
    all(clust1 == clust2_re)
    #> [1] TRUE
    

    Furthermore: igraph has a compare() function that returns the distance between clustering results, which also works when cluster labels differ. Let’s add a third cluster variation and change only the last value…

    clust3 <- c(3, 3, 3, 3, 5, 2, 2, 3, 3, 2, 5, 2, 3)
    

    When two clustering solutions are the same compare() returns 0

    library(igraph)
    compare(clust1, clust2)
    #> [1] 0
    

    Whenever there are differences the result will be > 0

    compare(clust1, clust3)
    #> [1] 0.4132943