Search code examples
rfor-loopcluster-analysissimilarity

Determine cluster similarity / set overlap


What I am trying to do is running two cluster algorithms (let's call them A and B) and then compare the results (i.e. the algorithms classify 80% of the models in the same clusters). A simple example with 3 models:

cl_A = c(1,2,1) #the result from Algorithm A
cl_B = c(2,1,2) #the result from Algorithm B

What do I hope to get as a solution from this? Well, 100%. Why? Because I can just "rename" the clusters in my head (rename clusters 1 for model B to 2 and cluster 2 to 1) and then find that the clusters are exactly the same.
In other words, the models that are in a cluster together are the same, and that is all we care about (we don't care about the "name" of the cluster and there is no inherent ordering in it).

What about this example?

cl_A = c(1,2,1,3)
cl_B = c(2,1,2,2)

(Note: The lengths of the vectors is always the same for both, but the values can be in different ranges)
In this case I'd like to get 3/4 as an answer (I.e. "rename" cl_B to c(1,2,1,1) and then say that there are 3 elements where cl_A and cl_B are the same.)

Now, I have written a function that I checked for simple examples by hand (i.e. the example above) but I can't help the feeling that for more complicated examples it's not functioning...
If any of you guys have ideas and / or solutions feel free to comment them.

This is my function, but I'll first explain what I do: I pass "cluster vectors" (the vectors of cluster assignments) to it (cl_A and cl_B in the example above). I then basically loop through all clusters for the first vector, and for all clusters for the second vector and choose the "best" overlap. Because I only want to choose each clusters once (i.e. I can't say I rename all the "1"s to "2"s, but later decide that I'd also like to rename some "1"s to "3"s (then I'd always get a perfect fit)) I keep a "taboo_list".
And that's basically all there is to it. However, it feels like it's not working 100% correctly, and I was hoping to find some help here. Thanks already!

#'  cluster similarity
#' 
#' calculate how "similar" two methods / clusters are:
#' sum(kmeans_cluster_similarity(cluster1, cluster2))/length(cluster1)
#' is the % of objects that are in a cluster with the same objects as before
#' 
#' @param cluster_vector_1 the first cluster object, as for example returned by knn() 
#' @param cluster_vector_2 the second cluster object
#' @export
#' 
cluster_similarity = function(cluster_vector_1, cluster_vector_2){
  taboo_list_2 <<- rep(NA, length(unique(cluster_vector_1)))
  overall_similarity <<- rep(NA, length(unique(cluster_vector_1)))
  
  for(i in unique(cluster_vector_1)){
    cl1 = which(cluster_vector_1 == i)
    similarity <- rep(NA, length(unique(cluster_vector_1)))
    for(j in unique(cluster_vector_2)){
      if(!(j %in% taboo_list_2)){
        cl2 = which(cluster_vector_2 == j)
        similarity[j] <- sum(cl1 %in% cl2)
      }
    }
    best_j <- which.max(similarity)
    taboo_list_2[i] <<- best_j
    overall_similarity[i] <<- max(similarity, na.rm = TRUE)
    #print(overall_similarity)
  }
  #print(overall_similarity)
  return(overall_similarity)
}

An example:

cl_A = c(1,2,1)
cl_B = c(2,1,2)
cluster_similarity(cl_A,cl_B)

works. But I'm pretty sure some other things don't work...

Edit

There seems to be some confusion as to why I am doing this, so let me try to clarify: I have data (who doesn't nowadays), and I obviously can't say from where exactly, but I thought of a nice analogy: think of several Kaggle case competitions (call them comp_A,comp_B,...).
For each competition you have several participants that hand some results in (call them part_1,...part_n and inp_1,...,inp_n respectively).
Now obviously not all participants hand something in for every competition. Competition A might have hand-ins from participants 1-20 while competition 2 might only have hand-ins from 1-10 and 20-25.
What I want to do is find out which "participants" are alike.
For example, part_1 is like part_2 and part_10, and so on

There is no validation set whatsoever (not even a small one), and for each "competition" there are about 20 participants, each with 1 input. These inputs are gigantic (well, 20MB each but it adds up)

My idea was then to cluster the participants (or rather, their input) for each competition, and see which participants are often in the same cluster (for example, if part_1 and part_2 are in the same cluster for comp_A and comp_B and comp_C, maybe they are alike)
And well, I wasn't aware of any theoretical justification for using one cluster method over the other, so I let them all run (and without a verification set it's pretty hard to evaluate), and now want to look, as @Logister correctly identified, the Robustness of each clustering algorithm to decide which one might be best.
I hope that clarifies the background of my question, and I'm always open for more constructive ideas!


Solution

  • This is a topic of cluster validation. There are already function in R that gives you values of "similarity" between clusters, such as Rand Index and Adjusted Rand Index. I suggest you using them.

    The Adjusted Rand Index is the best approach for measuring agreement between clusters.

    ARI measures not only the correct separation of elements belonging to a different classes, but also the relation between elements of the same class ( who said that )

    The ARI function can be find here.


    The math behind ARI is not basic at all. Because of this, I suggest you checking out the Rand Index measurement, which is very simple to understand, and implement it.


    Note: Your function does not take into account when similarity vector has a couple of max. What to do in that case? I suggest you watching this video