performance machine-learning confusion-matrix false-positive

Differences in FP and FN rates between two algorithems

I am conducting binary classification using logistic regression with and without applying PCA. The application of PCA before logistic regression gives a higher accuracy and lower FNs in comparison to logistic regression alone. I would like to find out why this is happening, specifically why PCA produces less FNs. I have read that cost sensitivity analysis could help explain this, but I am not sure if this is correct. Any suggestions?

Solution

There is no need of fancy analysis to explain this behavior.

PCA is used just for "clean" the data by limiting its variance. Let me explain this concept with an example, and then I will turn back to your question.

In general, in any ML problem, the available samples are never sufficient in number to cover all the possible variety of the sample space. You can never have a dataset with all the possible human faces, with all the possible expressions, etc.

So, instead of using all the available features you engineer the features (the pixels, in this example) in a way that you get more meaningful higher level features. You can reduce the resolution of the pictures, as easy example; you will loose the informations on the pictures background, but your model will focus better on the most important part of the picture, i.e. the faces.

When you deal with tabular data, a technique similar to the resolution lowering is cutting off parts of the original features, and that's what PCA do: it keeps the most important components of the features, the "Principal Components", dropping the less important ones.

So, the model trained with PCA gives better results because, by cutting off part of the features, your model focus better on the most important part of your samples, and so it gains robustness against overfitting.

cheers