Search code examples
apache-sparkpysparkpca

How to determine the optimum number of features in pca with pyspark


With sci-kit learn we can decide the number of features we'd like to keep based on the cumulative variance plot as below

from sklearn.decomposition import PCA

pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model

pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component

we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum

plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")

And then near 80% is the best number of features we could choose for pca.. enter image description here

My question is how to determine the best number of features with pyspark


Solution

  • we can determine this with the help of explainedVariance here how I did it.

    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.feature import PCA
    
    # used vector assembler to create the input the vector 
    vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')
    
    df = vectorAssembler.transform(dataset) # fetch data into vector assembler
    pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
    pcaModel = pca.fit(df) # fit the data to pca to make the model
    print(pcaModel.explainedVariance) # here it will explain the variances
    cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
    # plot the graph 
    plt.figure(figsize=(10,8))
    plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
    plt.title('variance by components')
    plt.xlabel('num of components')
    plt.ylabel('cumulative explained variance')
    

    choose the number of params near 80% enter image description here

    so in this case optimum number of params is 2