Search code examples
apache-sparkmachine-learningpysparkcluster-analysisk-means

K-means clustering algorithm in pyspark: syntax for defining the initial seed


I am analysing a k-means clustering algorithm in pyspark and I have a syntax doubt. This is the relevant part of the code:

from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
import numpy as np
kmeans_modeling = KMeans(k = 5, seed = 0)
model = kmeans_modeling.fit(data.select("parameters"))

What does the seed = 0 mean? Certainly we cannot initialize all the clusters with the seed on the same point, or we wouldn't obtain distinct clusters right?


Solution

  • According to the docs, this seed parameter is indeed a random seed, as suggested in the comments. The seed ensures that your machine learning run is reproducible by ensuring that the (pseudo)random number generator gives the same output in every run, provided that the input (including the random seed) is the same.

    If you're looking for cluster initialization options, you can see the docs as well. There are two options: initMode = "random" or initMode = "k-means||", where the latter is the default.