Search code examples
pythonmachine-learningpcafeature-selection

PCA features do not match original features


I am trying to reduce the feature dimensions using PCA. I have been able to apply PCA to my training data, but am struggling to understand why the reduced feature set (X_train_pca) shares no similarities with the original features (X_train).

print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)

most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]

Should the first feature vector in X_train_pca not be just a subset of the first feature vector in X_train? For example, why doesn't the following equal True?

print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False

Furthermore, none of the features from the first feature vector of X_train are in the first feature vector of X_train_pca:

for i in X_train[0]:
    print(i in X_train_pca[0])
# False
# False
# False
# ...

Solution

  • PCA transforms your high dimensional feature vectors into low dimensional feature vectors. It does not simply determine the least important index in the original space and drop that dimension.