Search code examples
machine-learningapache-spark-mllibpcasvdbigdata

How are PCA and SVD distributed in libraries like MLlib or Mahout


I know techniques for dimensionality reduction like PCA or SVD.

I would like to know how these techniques are implemented in distributed Big Data platforms like Apache Spark.

Is there available a pseudocode or schema with the formulation? I would like to know what parts of the algorithm could cause a bottleneck due to communication issues.

Thank you very much in advance


Solution

  • Apache Mahout implements Distributed Stochastic Singular Value Decomposition which is directly based on Randomized methods for computing low-rank approximations of matrices by Nathan Halko

    Note that dssvd is for Apache-Mahout Samsara which is a library that will run on top of Spark. So in essence this is a Spark based approach to svd which is in fact distributed.

    With regard to a distributed PCA, Mahout also exposes distributed stochastic PCA- there has been some website shuffling recently, but the dspca (distributed stochastic Principal component analysis) is given as an example here which gives the algorithm and implementation.

    Halko I believe (see reference above) also discusses distributed PCA. I can't tell you where the bottlenecks would be, but I hope this information gets you started in your research.