apache-spark statistics pca apache-spark-mllib

Spark PCA top components

In the spark mllib documents for Dimensionality Reduction there is a section about PCA that describe how to use PCA in spark. The computePrincipalComponents method requires a parameter that determine the number of top components that we want.

The problem is that I don't know how many components I want. I mean as few as possible. In Some other tools PCA gives us a table that shows if for example we choose those 3 components we'll cover 95 percents of data. So does Spark has this functionality in it's libraries or if it don't have how can I implement it in Spark?

Solution

Spark 2.0+:

This should be available out-of-the box. See SPARK-11530 for details.

Spark <= 1.6

Spark doesn't provide this functionality yet but it is not hard to implement using existing Spark code and definition of explained variance. Lets say we want to explain 75 percent of total variance:

val targetVar = 0.75

First lets reuse Spark code to compute SVD:

import breeze.linalg.{DenseMatrix => BDM, DenseVector => BDV, svd => brzSvd}
import breeze.linalg.accumulate
import java.util.Arrays

// Compute covariance matrix
val cov = mat.computeCovariance()

// Compute SVD
val brzSvd.SVD(u: BDM[Double], e: BDV[Double], _) = brzSvd(
  new BDM(cov.numRows, cov.numCols, cov.toArray))

Next we can find fraction of explained variance:

val varExplained = accumulate(e).map(x => x / e.toArray.sum).toArray

and number of components we have to get

val (v, k) = varExplained.zipWithIndex.filter{
    case (v, _) => v >= targetVar
}.head

Finally we can subset U once again reusing Spark code:

val n = mat.numCols.toInt

Matrices.dense(n, k + 1, Arrays.copyOfRange(u.data, 0, n * (k + 1)))