Search code examples
rcluster-analysisagent-based-modeling

R - find clusters of group 2 (pairs)


I am looking for a way to find clusters of group 2 (pairs). Is there a simple way to do that?

Imagine I have some kind of data where I want to match on x and y, like

library(cluster)
set.seed(1)

df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))

I want to find the closest pair of distances between the x_coord and y_coord:

d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.

Maybe there is a more effective alternative for doing this than clustering.

Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:

  id x_coord y_coord  pair_id
   1       9       3  1
   3       7       5  1 
   4       1       8  2
   5       2       2  2
   6       5       6  3
   7       3      10  3 
   9       6       4  4
  10       8       7  4

enter image description here


Solution

  • You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:

    pairs   <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
    df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
    df <- df[!is.na(df$pair),]
    
    df
    #>    id x_coord y_coord pair
    #> 1   1       9       3    4
    #> 3   3       7       5    4
    #> 4   4       1       8    1
    #> 5   5       2       2    1
    #> 6   6       5       6    2
    #> 7   7       3      10    2
    #> 9   9       6       4    3
    #> 10 10       8       7    3
    

    Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line

    df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))
    

    Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:

    d = stats::dist(df[,c(1,2)], diag = T)
    h = hclust(d)
    plot(h)
    

    enter image description here

    And we can see the method scales nicely:

    df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
    d = stats::dist(df[,c(1,2)], diag = T)
    h = hclust(d)
    pairs   <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
    df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
    df <- df[!is.na(df$pair),]
    d = stats::dist(df[,c(1,2)], diag = T)
    h = hclust(d)
    plot(h)
    

    enter image description here