I was working my way through An Introduction to Outlier Analysis by Charu Aggarwal and doing Exercise 7 from Chapter 1.
I am trying to use k-Means Clustering to identify an outlier in my data. What I was attempting to do was to create two clusters and measure each data point's distance to its respective cluster center in order to determine which items are outliers.
Here's the histogram for my data (generated with Matlab):
Here's the code that I used to create the histogram (with apologies for the fact that that bit's in Matlab rather than Python):
x = [1, 2, 3, 10, 10, 9, 10];
y = [9, 9, 9, 10, 3, 1, 2];
histogram2(x, y)
Based on the histogram, I believe that the only outlier is (10, 10)
.
I tried the following code:
from sklearn.cluster import KMeans
import numpy as np
import math
data = np.array([(1, 9), (2, 9), (3, 9), (10, 10), (10, 3), (9, 1), (10, 2)])
kmeans = KMeans(n_clusters = 2, random_state = 0).fit(data)
print(kmeans.predict(data))
# Based on my understanding of the documentation, this should give a matrix
# of the distances to the centroids for each data point
temp = kmeans.transform(data)
# Calculate Euclidean Distance
for x, y in temp:
print('x: ' + str(x) + ' y: ' + str(y) + ' dist: ' + str(math.sqrt((x * x) + (y * y))))
Again, based on the histogram I was expecting a single outlier. However, my code prints the following:
x: 10.077822185373186 y: 1.0 dist: 10.127314550264547
x: 9.222933372848358 y: 0.0 dist: 9.222933372848358
x: 8.40014880820572 y: 1.0 dist: 8.459462157844314
x: 6.005206074732157 y: 8.06225774829855 dist: 10.05298463144155
x: 1.0307764064044151 y: 10.0 dist: 10.05298463144155
x: 3.092329219213245 y: 10.63014581273465 dist: 11.070794912742265
x: 2.0155644370746373 y: 10.63014581273465 dist: 10.819542504191201
(Before anyone points this out, yes, I do realize that x
and y
are used in a somewhat misleading way here). I was expecting that the point corresponding to (10, 10)
would have a significantly higher Euclidean Distance from the center than the rest of the points, but all of the distances are quite close to each other.
I'm slightly baffled as to what my mistake is. I'm somewhat suspicious of the last step (with the Euclidean Distance). Is this the correct procedure to use here? Or am I missing some other mistake?
Also, is it reasonable to expect that k-Means Clustering would reveal that outlier in the first place?
The interpretation of kmeans.transform
isn't quite correct.
I think the question is asking: "How do I calculate the distance to the closest cluster centroid?" which should go through the kmeans.cluster_centers_
attribute.
For example, the following returns the minimum of the 2-norm (euclidean distance) between each point in the original data and the two cluster centers found using kmeans:
>>> for a in data:
... print(a, np.min([np.linalg.norm(a - kmeans.cluster_centers_[0]), np.linalg.norm(a - kmeans.cluster_centers_[1])]))
...
[1 9] 1.0
[2 9] 0.0
[3 9] 1.0
[10 10] 6.005206074732157
[10 3] 1.0307764064044151
[9 1] 3.092329219213245
[10 2] 2.0155644370746373
From this, we might notice that most of the "inliers" fall within 0-4 units, and the "outlier" falls around 6.0 units from the centers.
(Edit: Missed the second question)
Also, is it reasonable to expect that k-Means Clustering would reveal that outlier in the first place?
Here: Yes. In general: It depends on whether the 2-norm is a good metric within your space, and (somewhat) on the number of dimensions you're working with (in high-dimensional space, everything is far apart from everything else).