Search code examples
scalaapache-sparkapache-spark-mllib

StreamingKMeans setSeed()


I need to train StreamingKMeans with a specific value for seed. When I run

val km = new StreamingKMeans(3, 1.0, "points")
km.setRandomCenters(10, 0.5)
val newmodel = km.latestModel.update(featureVectors, 1.0, "points")

val prediction3 = id_features.map(x=> (x._1, newmodel.predict(x._2)))

it works fine. But when I am trining to use sedSeed:

km.setRandomCenters(10, 0.5).setSeed(6250L)

I am getting an error:

value setSeed is not a member of org.apache.spark.mllib.clustering.StreamingKMeans

How can I set the seed in this case?


Solution

  • The error is telling you that there is no setSeed member of org.apache.spark.mllib.clustering.StreamingKMeans (which you can verify from the API docs; oddly, this method does exist for the KMeans class, but not for StreamingKMeans).

    However, all is not lost... ;-)

    The setRandomCenters method takes 3 parameters, with the third being the random seed. It's value defaults to Utils.random.nextLong. To do what you want, you should change that line from:

    km.setRandomCenters(10, 0.5).setSeed(6250L)
    

    to:

    km.setRandomCenters(10, 0.5, 6250L)
    

    UPDATE: Incidentally, Spark utilizes the functional programming paradigm. Consequently, calling a method such as .setRandomCenters on an StreamingKMeans instance typically does not modify (or mutate) that instance. Rather, it creates a new instance with the modifications applied to it.

    In your code, you effectively discard the changes made by SetRandomCenters, because you do not store the result. Your code ought to look something more like this:

    val km = new StreamingKMeans(3, 1.0, "points").setRandomCenters(10, 0.5)
    
    val newmodel = km.latestModel.update(featureVectors, 1.0, "points")
    
    val prediction3 = id_features.map(x=> (x._1, newmodel.predict(x._2)))