I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20]
, correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_
(or even explained_variance_ratio_
) along with the components_
output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_)
and print(X_reduced.explained_variance_ratio_)
to get both. Then simply take the first N that you want from X_reduced.components_
after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_
is of the shape [n_components, n_features]
so for instance if you want the first 20 components you should use X_reduced.components[:20, :]
I believe.