Search code examples
pythonapache-sparkpysparkcluster-analysisk-means

'KMeansModel' object has no attribute 'computeCost' in apache pyspark


I'm experimenting with a clustering model in pyspark. I'm trying to get the mean squared cost of the cluster fit for different values of K

def meanScore(k,df):
  inputCol = df.columns[:38]
  assembler = VectorAssembler(inputCols=inputCols,outputCol="features")
  kmeans = KMeans().setK(k)
  pipeModel2 = Pipeline(stages=[assembler,kmeans])
  kmeansModel = pipeModel2.fit(df).stages[-1]
  kmeansModel.computeCost(assembler.transform(df))/data.count()

When I try to call this function to compute costs for different values of K in the dataframe

for k in range(20,100,20):
  sc = meanScore(k,numericOnly)
  print((k,sc))

I receive an attribute error as AttributeError: 'KMeansModel' object has no attribute 'computeCost'

I'm fairly new to pyspark and am just learning, I sincerely appreciate any help with this. Thanks


Solution

  • It is deprecated in Spark 3.0.0 Docs suggest using the evaluator.

    Note Deprecated in 3.0.0. It will be removed in future versions. 
    Use ClusteringEvaluator instead. You can also get the cost on the training dataset in the summary.