Search code examples
rcluster-analysis

hierarchical clustering with gower distance - hclust() and philentropy::distance()


I've got a mixed data set (categorical and continuous variables) and I'd like to do hierarchical clustering using Gower distance.

I base my code on an example from https://www.r-bloggers.com/hierarchical-clustering-in-r-2/, which uses base R dist() for Euclidean distance. Since dist() doesn't compute Gower distance, I've tried using philentropy::distance() to compute it but it doesn't work.

Thanks for any help!

# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)

# Hierarchical clustering with Euclidean distance - works 
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)

# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

Solution

  • The error is in the distance function itself.

    I don't know if it's intentional or not, but the current implementation of philentropy::distance with the "gower" method cannot handle any mixed data types, since the first operation is to transpose the data.frame, producing a character matrix which then throws the typing error when passed to the DistMatrixWithoutUnit function.

    You might try using the daisy function from cluster instead.

    library(cluster)
    
    x <- mtcars[,1:2]
    
    x$cyl <- as.factor(x$cyl)
    
    dist <- daisy(x, metric = "gower")
    
    cls <- hclust(dist)
    
    plot(cls)
    

    EDIT: For future reference it seems like philentropy will be updated to included better type handling in the next version. From the vignette

    In future versions of philentropy I will optimize the distance() function so that internal checks for data type correctness and correct input data will take less termination time than the base dist() function.