I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result.
My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components.
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]])
pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X)
pca.components_
array([[ 0.71417303, 0.46711713, 0. , 0.52130459],
[-0.46602418, -0.23839061, -0. , 0.85205128]])
np.sum(np.abs(pca.components_), axis=0)
array([1.18019721, 0.70550774, 0. , 1.37335586])
This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value.
Is there a better "measure of importance" for PCA?
The measure of importance for PCA is in explained_variance_ratio_
. This array provides percentage of variance explained by each component. It is sorted by importance of the components in descending order and sums up to 1 when all the components are used, or minimal possible value above the requested threshold. In your example you set a threshold to 95% (of variance that should be explained), so the array sum will be 0.9949522861608583 as the first component explains 92.021143% and the second 7.474085% of the variance, hence the 2 components you receive.
components_
is the array that stores the directions of maximum variance in the feature space. It's dimensions are n_components_
by n_features_
. This is what you multiply the data point(s) by when applying transform()
to get reduced dimensionality projection of the data.
In order to get the percentage of contribution of the original features to each of the Principal Components, you just need to normalize components_
, as they set the amount original vectors contribute to the projection.
r = np.abs(pca.components_.T)
r/r.sum(axis=0)
array([[0.41946155, 0.29941172],
[0.27435603, 0.15316146],
[0. , 0. ],
[0.30618242, 0.54742682]])
As you can see third feature does not contribute to the PCs.
If you need the total contribution of the original features to the explained variance, you need to take into account each PC contribution (i.e. explained_variance_ratio_
):
ev = np.abs(pca.components_.T).dot(pca.explained_variance_ratio_)
ttl_ev = pca.explained_variance_ratio_.sum()*ev/ev.sum()
print(ttl_ev)
[0.40908847 0.26463667 0. 0.32122715]