Search code examples
apache-sparkstatisticspcaapache-spark-mllib

Spark PCA top components


In the spark mllib documents for Dimensionality Reduction there is a section about PCA that describe how to use PCA in spark. The computePrincipalComponents method requires a parameter that determine the number of top components that we want.

The problem is that I don't know how many components I want. I mean as few as possible. In Some other tools PCA gives us a table that shows if for example we choose those 3 components we'll cover 95 percents of data. So does Spark has this functionality in it's libraries or if it don't have how can I implement it in Spark?


Solution

  • Spark 2.0+:

    This should be available out-of-the box. See SPARK-11530 for details.

    Spark <= 1.6

    Spark doesn't provide this functionality yet but it is not hard to implement using existing Spark code and definition of explained variance. Lets say we want to explain 75 percent of total variance:

    val targetVar = 0.75
    

    First lets reuse Spark code to compute SVD:

    import breeze.linalg.{DenseMatrix => BDM, DenseVector => BDV, svd => brzSvd}
    import breeze.linalg.accumulate
    import java.util.Arrays
    
    // Compute covariance matrix
    val cov = mat.computeCovariance()
    
    // Compute SVD
    val brzSvd.SVD(u: BDM[Double], e: BDV[Double], _) = brzSvd(
      new BDM(cov.numRows, cov.numCols, cov.toArray))
    

    Next we can find fraction of explained variance:

    val varExplained = accumulate(e).map(x => x / e.toArray.sum).toArray
    

    and number of components we have to get

    val (v, k) = varExplained.zipWithIndex.filter{
        case (v, _) => v >= targetVar
    }.head
    

    Finally we can subset U once again reusing Spark code:

    val n = mat.numCols.toInt
    
    Matrices.dense(n, k + 1, Arrays.copyOfRange(u.data, 0, n * (k + 1)))