Does anyone know any simple algorithm in Python / PySpark to detect outliers in K-means clustering and to create a list or data frame of those outliers? I'm not sure how to obtain the centroids. I am using the following code:
n_clusters = 10
kmeans = KMeans(k = n_clusters, seed = 0)
model = kmeans.fit(Data.select("features"))
model.clusterCenters()
will give you the centroids.
To get the outliers, a straightforward way is to get the clusters with a size of 1.
Example:
data.show()
+-------------+
| features|
+-------------+
| [0.0,0.0]|
| [1.0,1.0]|
| [9.0,8.0]|
| [8.0,9.0]|
|[100.0,100.0]|
+-------------+
from pyspark.ml.clustering import KMeans
kmeans = KMeans()
model = kmeans.fit(data)
model.summary.predictions.show()
+-------------+----------+
| features|prediction|
+-------------+----------+
| [0.0,0.0]| 0|
| [1.0,1.0]| 0|
| [9.0,8.0]| 0|
| [8.0,9.0]| 0|
|[100.0,100.0]| 1|
+-------------+----------+
print(model.clusterCenters())
[array([4.5, 4.5]), array([100., 100.])]
print(model.summary.clusterSizes)
[4, 1]
# Get outliers with cluster size = 1
import pyspark.sql.functions as F
model.summary.predictions.filter(
F.col('prediction').isin(
[cluster_id for (cluster_id, size) in enumerate(model.summary.clusterSizes) if size == 1]
)
).show()
+-------------+----------+
| features|prediction|
+-------------+----------+
|[100.0,100.0]| 1|
+-------------+----------+