I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:
df = array([[12534.],
[12014.],
[12158.],
[11935.],
...,
[ 5120.],
[ 4828.],
[ 4443.]])
I'm able to fit and to predict values with the following instructions:
km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)
In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.
I found examples which used euclidean distance to calculate the distance. An example is the following one:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return distances
centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
distances.append({'x': cx, 'y': cy, 'distance': mean_distance})
This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:
array([[11899.90692187],
[ 5406.54143126]])
In this case, what is the correct approach to find the distance between centroid and points? Is it possible?
Thank you and sorry for the trivial question, i'm still learning
You can use scipy.spatial.distance.cdist
to create a distance matrix:
from scipy.spatial.distance import cdist
dm = cdist(df, centroids)
This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. cdist
uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature).