Search code examples
pythonrsimilaritycategorical-datar-daisy

Python equivalent of daisy() in the cluster package of R


I have a dataset that contains both categorical (nominal and ordinal) and numerical attributes. I want to calculate the (dis)similarity matrix across my observations using these mixed attributes. Using the daisy() function of the cluster package in R, I can easily get a dissimilarity matrix as follows:

if(!require("cluster")) { install.packages("cluster");  require("cluster") }
data(flower)
as.matrix(daisy(flower, metric = "gower"))

This uses the gower metric to deal with the nominal variables. Is there a Python equivalent of the daisy() function in R?

Or maybe any other module function that allows using the Gower metric or something similar to calculate the (dis)similarity matrix for a dataset with mixed (nominal, numeric) attributes?


Solution

  • I believe you are looking for scipy.spatial.distance.pdist.

    If you implement a function that computes the Gower distance on a single pair of observations, you can pass that function to pdist and it will apply it pairwise and return the resulting matrix of pairwise distances. It does not appear that the Gower distance is one of the built-in options.

    Likewise, if a single observation has mixed attributes, you can just define your own function which, say, uses something like the Euclidean distance on the subset of numerical attributes, a Gower distance on the subset of categorical attributes, and adds them -- or any other implementation of what it means to you, for your application, to compute the distance between two isolated observations.

    For clustering in Python, usually you want to work with scikits.learn and this question and answer page discusses exactly this problem of using a custom distance measure (in your case Gower) with scikits -- which does not appear possible.

    You could use one of the choices provided by pdist along with the implementation at that linked answer page -- or you could implement a function for the Gower similarity and use that. But if you want the out-of-the-box clustering tools from scikits, it does not appear to be directly possible.