I need to train StreamingKMeans with a specific value for seed. When I run
val km = new StreamingKMeans(3, 1.0, "points")
km.setRandomCenters(10, 0.5)
val newmodel = km.latestModel.update(featureVectors, 1.0, "points")
val prediction3 = id_features.map(x=> (x._1, newmodel.predict(x._2)))
it works fine. But when I am trining to use sedSeed:
km.setRandomCenters(10, 0.5).setSeed(6250L)
I am getting an error:
value setSeed is not a member of org.apache.spark.mllib.clustering.StreamingKMeans
How can I set the seed in this case?
The error is telling you that there is no setSeed
member of org.apache.spark.mllib.clustering.StreamingKMeans
(which you can verify from the API docs; oddly, this method does exist for the KMeans
class, but not for StreamingKMeans
).
However, all is not lost... ;-)
The setRandomCenters
method takes 3 parameters, with the third being the random seed. It's value defaults to Utils.random.nextLong
. To do what you want, you should change that line from:
km.setRandomCenters(10, 0.5).setSeed(6250L)
to:
km.setRandomCenters(10, 0.5, 6250L)
UPDATE: Incidentally, Spark utilizes the functional programming paradigm. Consequently, calling a method such as .setRandomCenters
on an StreamingKMeans
instance typically does not modify (or mutate) that instance. Rather, it creates a new instance with the modifications applied to it.
In your code, you effectively discard the changes made by SetRandomCenters
, because you do not store the result. Your code ought to look something more like this:
val km = new StreamingKMeans(3, 1.0, "points").setRandomCenters(10, 0.5)
val newmodel = km.latestModel.update(featureVectors, 1.0, "points")
val prediction3 = id_features.map(x=> (x._1, newmodel.predict(x._2)))