How to determine the optimum number of features in pca with pyspark

With sci-kit learn we can decide the number of features we'd like to keep based on the cumulative variance plot as below

from sklearn.decomposition import PCA

pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model

pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component

we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum

plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")

And then near 80% is the best number of features we could choose for pca..

My question is how to determine the best number of features with pyspark

Solution

we can determine this with the help of explainedVariance here how I did it.

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA

# used vector assembler to create the input the vector 
vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')

df = vectorAssembler.transform(dataset) # fetch data into vector assembler
pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
pcaModel = pca.fit(df) # fit the data to pca to make the model
print(pcaModel.explainedVariance) # here it will explain the variances
cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
# plot the graph 
plt.figure(figsize=(10,8))
plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
plt.title('variance by components')
plt.xlabel('num of components')
plt.ylabel('cumulative explained variance')

choose the number of params near 80%

so in this case optimum number of params is 2