Search code examples
scikit-learnpca

Principle Component Analysis (PCA) Explained Variance remains the same after changing dataframe position


I have a dataframe where A and B is used to predict C

df = df[['A','B','C']]
array = df.values

X = array[:,0:-1]
Y = array[:,-1]

# Feature Importance
model = GradientBoostingClassifier()
model.fit(X, Y)
print ("Importance:")
print((model.feature_importances_)*100)


#PCA
pca = PCA(n_components=len(df.columns)-1)
fit = pca.fit(X)

print("Explained Variance")
print(fit.explained_variance_ratio_)

This prints

Importance:
[ 53.37975706  46.62024294]
Explained Variance
[ 0.98358394  0.01641606]

However when I changed the dataframe position swapping A and B, only the importance changed, but the Explain variance remains, why did the explained variance not change according to [0.01641606 0.98358394]?

df = df[['B','A','C']]


Importance:
[ 46.40771024  53.59228976]
Explained Variance
[ 0.98358394  0.01641606]

Solution

  • Explained variance does not refer to A or B or any columns of your dataframe. It refers to the principal components identified by the PCA, which are some linear combinations of the columns. These components are sorted in the order of decreasing variance as the documentation says:

    components_ : array, shape (n_components, n_features) Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.

    explained_variance_ : array, shape (n_components,) The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.

    explained_variance_ratio_ : array, shape (n_components,) Percentage of variance explained by each of the selected components.

    So, the order of features does not affect the order of components returned. It does affect the array components_ which is a matrix that can be used to map principal components to the feature space.