Search code examples
pythonmatplotlibmachine-learningscikit-learnpca

Classify using components from PCA


I used PCA analysis on my dataset like so:

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(scale_x)
principalDf = pd.DataFrame(data=principalComponents, columns = ['PC1', 'PC2', 'PC3'])

and then on visualizing the results with MatPlotLib - I can see a division between my two classes like so:

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(principalDf['PC1'].values, principalDf['PC2'].values, principalDf['PC3'].values, c=['red' if m==0 else 'green' for m in y], marker='o')

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')

plt.show()

PCA 3D Plot

but then when I use a classification model like SVM or Logistic Regression, it is unable to learn this relation:

from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver = 'lbfgs')
lg.fit(principalDf.values, y)
lg_p = lg.predict(principalDf.values)
print(classification_report(y, lg_p, target_names=['Failure', 'Success']))
                 precision    recall  f1-score   support

        Failure       1.00      0.03      0.06        67
        Success       0.77      1.00      0.87       219

       accuracy                           0.77       286
      macro avg       0.89      0.51      0.46       286
   weighted avg       0.82      0.77      0.68       286

What could be the reason for this?


Solution

  • First, use three features PC1, PC2, PC3. Additional features (PC4 ~ PC6), which is not expressed in the graph, may affects the classification result.

    Second, a classifier sometimes is not trained well as you think. I recommend to use decision tree instead of the classifiers you use, because tree is (horizon) linear classifier and it would be yield the result you think.