Search code examples
pythonmachine-learningcluster-analysisclassification

Wrap-around when calculating distance for k-means


I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference. I'm doing something simple, similar to the following:

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)

data elements looks something like [22, 418, 192] where the first element is the hour.

Any ideas?


Solution

  • Why k-means doesn't work with arbitrary distances

    K-means is not a distance-based algorithm.

    K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).

    In order for Lloyds algorithm to converge you need to have both steps optimize the same function:

    • the reassignment step
    • the centroid update step

    Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.

    If you plug in a random other distance function as suggested by @elyase k-means might no longer converge.

    Proper solutions

    There are various solutions to this:

    • Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
    • Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
    • Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.