Search code examples
pythonpython-3.xpandasmachine-learningk-means

Find distance between centroid and points in a single feature dataframe - KMeans


I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:

df = array([[12534.],
           [12014.],
           [12158.],
           [11935.],
           ...,
           [ 5120.],
           [ 4828.],
           [ 4443.]])

I'm able to fit and to predict values with the following instructions:

km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)

In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.

I found examples which used euclidean distance to calculate the distance. An example is the following one:

def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
    distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
    return distances

centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
    mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
    distances.append({'x': cx, 'y': cy, 'distance': mean_distance})

This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:

array([[11899.90692187],
       [ 5406.54143126]])

In this case, what is the correct approach to find the distance between centroid and points? Is it possible?

Thank you and sorry for the trivial question, i'm still learning


Solution

  • You can use scipy.spatial.distance.cdist to create a distance matrix:

    from scipy.spatial.distance import cdist
    dm = cdist(df, centroids)
    

    This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. cdist uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature).