Search code examples
scalaapache-sparkpca

How to save PCA object in spark scala?


I'm doing PCA on my data and I read the guide from: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction

The relevant code is following:

import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD

val data: RDD[LabeledPoint] = sc.parallelize(Seq(
  new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
  new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
  new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))

// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))

// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))

This code perform PCA upon the data. However, I can't find example code or doc explaining how to save and load the fitted PCA object for future using. Could someone give me an example based on the above code?


Solution

  • It seems that the PCA mlib version does not support save the model to disk. You can save the pc matrix of the resulting PCAModel instead. However, use the spar ML version. It returns a Spark Estimator that can be serialized and included in a Spark ML Pipeline.