I used PCA analysis on my dataset like so:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(scale_x)
principalDf = pd.DataFrame(data=principalComponents, columns = ['PC1', 'PC2', 'PC3'])
and then on visualizing the results with MatPlotLib - I can see a division between my two classes like so:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(principalDf['PC1'].values, principalDf['PC2'].values, principalDf['PC3'].values, c=['red' if m==0 else 'green' for m in y], marker='o')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.show()
but then when I use a classification model like SVM or Logistic Regression, it is unable to learn this relation:
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression(solver = 'lbfgs')
lg.fit(principalDf.values, y)
lg_p = lg.predict(principalDf.values)
print(classification_report(y, lg_p, target_names=['Failure', 'Success']))
precision recall f1-score support Failure 1.00 0.03 0.06 67 Success 0.77 1.00 0.87 219 accuracy 0.77 286 macro avg 0.89 0.51 0.46 286 weighted avg 0.82 0.77 0.68 286
What could be the reason for this?
First, use three features PC1, PC2, PC3. Additional features (PC4 ~ PC6), which is not expressed in the graph, may affects the classification result.
Second, a classifier sometimes is not trained well as you think. I recommend to use decision tree instead of the classifiers you use, because tree is (horizon) linear classifier and it would be yield the result you think.