Search code examples
apache-sparkpysparkapache-spark-mllib

PySpark, How to simply count the number of each cluster in Kmeans model?


I trained a Kmeans model:

kmeans = KMeans(k=20, seed=1)
df.show()
kmeans_model = kmeans.fit(df)

I just want to simply count how many elements in each cluster, but I can't find a simple way to achieve it.


Solution

  • Checked the pyspark document. Here is the answer:

    summary = kmeans_model.summary
    print(summary.clusterSizes)
    

    Reference:

    http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans