Search code examples
pythonmachine-learningpysparkcluster-analysisk-means

PySpark: AttributeError: 'PipelineModel' object has no attribute 'clusterCenters'


I created a kmeans algorithm with Pypsark. Now, I want to also extract the cluster centers. How do I include it in the pipeline? This is the code that I have so far, but it throws me an error 'AttributeError: 'PipelineModel' object has no attribute 'clusterCenters'. How can it be fixed?

#### model K-Means ###

from pyspark.ml.clustering import KMeans, KMeansModel

kmeans = KMeans() \
          .setK(3) \
          .setFeaturesCol("scaledFeatures")\
          .setPredictionCol("cluster")

# Chain indexer and tree in a Pipeline
pipeline = Pipeline(stages=[kmeans])

model = pipeline.fit(matrix_normalized)

cluster = model.transform(matrix_normalized)

#get cluster centers
centers = model.clusterCenters()

Solution

  • dummy data

    from pyspark.ml.linalg import Vectors
    from pyspark.ml.clustering import KMeans, KMeansModel
    from pyspark.ml.pipeline import Pipeline
    
    
    data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
            (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
    matrix_normalized = spark.createDataFrame(data, ["scaledFeatures"])
    

    your code

    kmeans = KMeans() \
              .setK(3) \
              .setFeaturesCol("scaledFeatures")\
              .setPredictionCol("cluster")
    
    # Chain indexer and tree in a Pipeline
    pipeline = Pipeline(stages=[kmeans])
    
    model = pipeline.fit(matrix_normalized)
    
    cluster = model.transform(matrix_normalized)
    

    just change the last line

    model.stages[0].clusterCenters()
    
    [array([0.5, 0.5]), array([8., 9.]), array([9., 8.])]