Search code examples
rcluster-analysishierarchical-clustering

R - How to find closest neighbours from dissimilarity matrix?


I have a dissimilarity matrix (gower.dist) and now I would like to find the closest n neighbours to a certain datapoint (e.g. row number 50). Can anyone help me?

Sample Data https://towardsdatascience.com/hierarchical-clustering-on-categorical-data-in-r-a27e578f2995

#----- Dummy Data -----#
library(dplyr)
set.seed(40)
id.s <- c(1:200) %>%
        factor()
budget.s <- sample(c("small", "med", "large"), 200, replace = T) %>%
            factor(levels=c("small", "med", "large"), 
            ordered = TRUE)
origins.s <- sample(c("x", "y", "z"), 200, replace = T, 
             prob = c(0.7, 0.15, 0.15))
area.s <- sample(c("area1", "area2", "area3", "area4"), 200, 
          replace = T,
          prob = c(0.3, 0.1, 0.5, 0.2))
source.s <- sample(c("facebook", "email", "link", "app"), 200,   
            replace = T,
            prob = c(0.1,0.2, 0.3, 0.4))
dow.s <- sample(c("mon", "tue", "wed", "thu", "fri", "sat", "sun"), 200, replace = T,
         prob = c(0.1, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2)) %>%
         factor(levels=c("mon", "tue", "wed", "thu", "fri", "sat", "sun"), 
        ordered = TRUE)
dish.s <- sample(c("delicious", "the one you don't like", "pizza"), 200, replace = T)

synthetic.customers <- data.frame(id.s, budget.s, origins.s, area.s, source.s, dow.s, dish.s)

#----- Dissimilarity Matrix -----#
library(cluster) 
# to perform different types of hierarchical clustering
# package functions used: daisy(), diana(), clusplot()
gower.dist <- daisy(synthetic.customers[ ,2:7], metric = c("gower"))

Solution

  • Assuming you want the 5 closest neighbors to datapoint 50:

    row_number <- 50
    n <- 5
    
    dists <- unname(as.matrix(gower.dist)[row_number,])
    order(dists)[1:n]
    
    # 50  83 112  60  75
    

    Indeed, number 50 is closest to itself, and the rest are the 4 next closest values.
    The "trick" is to convert your object to a matrix, extract the row you're interested in, and use the base R order function to find the indices of the smallest values in that row.