I am trying to reduce the feature dimensions using PCA. I have been able to apply PCA to my training data, but am struggling to understand why the reduced feature set (X_train_pca
) shares no similarities with the original features (X_train
).
print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)
most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]
Should the first feature vector in X_train_pca
not be just a subset of the first feature vector in X_train
? For example, why doesn't the following equal True?
print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False
Furthermore, none of the features from the first feature vector of X_train
are in the first feature vector of X_train_pca
:
for i in X_train[0]:
print(i in X_train_pca[0])
# False
# False
# False
# ...
PCA transforms your high dimensional feature vectors into low dimensional feature vectors. It does not simply determine the least important index in the original space and drop that dimension.