Search code examples
pysparkpcaapache-spark-ml

How to get explained variance per PCA component in pyspark


As far as I know, pyspark offers PCA API like:

from pyspark.ml.feature import PCA
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(data_frame) 

However in reality, I find explained variances ratio is more widely used. For example, in sklearn:

from sklearn.decomposition import PCA
pca_fitter = PCA(n_components=0.85)

Does anyone know how to implement explained variance ratio in pyspark? Thanks!


Solution

  • From Spark 2.0 onwards, PCAModel includes an explainedVariance method; from the docs:

    explainedVariance

    Returns a vector of proportions of variance explained by each principal component.

    New in version 2.0.0.

    Here is an example with k=2 principal components and toy data, adapted from the documentation:

    spark.version
    # u'2.2.0'
    
    from pyspark.ml.linalg import Vectors
    from pyspark.ml.feature import PCA
    
    data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    ...     (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    ...     (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
    
    df = spark.createDataFrame(data,["features"])
    pca = PCA(k=2, inputCol="features", outputCol="pca_features")
    model = pca.fit(df)
    
    model.explainedVariance
    # DenseVector([0.7944, 0.2056])
    

    i.e. from our k=2 principal components, the first one explains 79.44% of the variance, while the second one explains the remaining 20.56%.