Search code examples
pythonapache-sparkpysparkk-meansoutliers

Obtaining k-means centroids and outliers in python / pyspark


Does anyone know any simple algorithm in Python / PySpark to detect outliers in K-means clustering and to create a list or data frame of those outliers? I'm not sure how to obtain the centroids. I am using the following code:

n_clusters = 10

kmeans = KMeans(k = n_clusters, seed = 0)
model = kmeans.fit(Data.select("features"))

Solution

  • model.clusterCenters() will give you the centroids.

    To get the outliers, a straightforward way is to get the clusters with a size of 1.

    Example:

    data.show()
    +-------------+
    |     features|
    +-------------+
    |    [0.0,0.0]|
    |    [1.0,1.0]|
    |    [9.0,8.0]|
    |    [8.0,9.0]|
    |[100.0,100.0]|
    +-------------+
    
    from pyspark.ml.clustering import KMeans
    kmeans = KMeans()
    model = kmeans.fit(data)
    model.summary.predictions.show()
    +-------------+----------+
    |     features|prediction|
    +-------------+----------+
    |    [0.0,0.0]|         0|
    |    [1.0,1.0]|         0|
    |    [9.0,8.0]|         0|
    |    [8.0,9.0]|         0|
    |[100.0,100.0]|         1|
    +-------------+----------+
    
    print(model.clusterCenters())
    [array([4.5, 4.5]), array([100., 100.])]
    
    print(model.summary.clusterSizes)
    [4, 1]
    
    # Get outliers with cluster size = 1
    import pyspark.sql.functions as F
    model.summary.predictions.filter(
        F.col('prediction').isin(
            [cluster_id for (cluster_id, size) in enumerate(model.summary.clusterSizes) if size == 1]
        )
    ).show()
    +-------------+----------+
    |     features|prediction|
    +-------------+----------+
    |[100.0,100.0]|         1|
    +-------------+----------+