Search code examples
pythonnumpycluster-analysisdistance

Calculating medoid of a cluster (Python)


So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.

I'm employing a fractional distance metric in order to calculate distances:

where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1

where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1

I would then calculate the medoid as:

where S is the set of datapoints, and δ is the absolute value of the distance metric used above

where S is the set of datapoints, and δ is the absolute value of the distance metric used above.

I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.

Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:

  • I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
  • I know numpy has an argmin() function (documented here).
  • Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).

Solution

    1. compute pairwise distance matrix
    2. compute column or row sum
    3. argmin to find medoid index

    i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.