With sci-kit learn we can decide the number of features we'd like to keep based on the cumulative variance plot as below
from sklearn.decomposition import PCA
pca = PCA() # init pca
pca.fit(dataset) # fit the dataset into pca model
pca.explained_variance_ratio # this attribute shows how much variance is explained by each of the seven individual component
we can plot the cumulative value as below
plt.figure(figsize= (10, 8)) # size of the chart(size of the vectors)
cumulativeValue = pca.explained_variance_ratio_.cumsum() # get the cumulative sum
plt.plot(range(1,8), cumulativeValue, marker = 'o', linestyle="--")
And then near 80% is the best number of features we could choose for pca..
My question is how to determine the best number of features with pyspark
we can determine this with the help of explainedVariance
here how I did it.
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA
# used vector assembler to create the input the vector
vectorAssembler = VectorAssembler(inputCols=['inputCol1', 'inputCol2', 'inputCol3', 'inputCol4'], outputCol='pcaInput')
df = vectorAssembler.transform(dataset) # fetch data into vector assembler
pca = PCA(k=8, inputCol="pcaInput", outputCol="features") # here I Have defined maximum number of features that I have
pcaModel = pca.fit(df) # fit the data to pca to make the model
print(pcaModel.explainedVariance) # here it will explain the variances
cumValues = pcaModel.explainedVariance.cumsum() # get the cumulative values
# plot the graph
plt.figure(figsize=(10,8))
plt.plot(range(1,9), cumValues, marker = 'o', linestyle='--')
plt.title('variance by components')
plt.xlabel('num of components')
plt.ylabel('cumulative explained variance')
choose the number of params near 80%
so in this case optimum number of params is 2