Search code examples
logistic-regressionpcaprincipal

How to use principal component analysis for logistic regression


I'm interested in using logistic regression to classify opera singing (n=100 audiofiles) from non opera singing (n=300 audiofiles) (just an example). I have multiple features that I can use (i.e. MFCC, pitch, signal energy). I would like to use PCA to reduce dimensionality, which will drop the 'least important variables'. My question is, should I do my PCA on my whole dataset (but opera and non-opera)? Because if I do, wouldn't this drop the 'least important variables' for both opera and non-opera rather than drop the variables least important for identifying opera?


Solution

  • Short answer:

    You must do your PCA on the whole data.

    Not so short answer:

    1. First combine the samples from both classes.
    2. Then split your data in train and test. (train and test sets must contain data from both classes)
    3. Use your train data to fit you PCA model
    4. Apply the PCA transformation you trained in (3) on both train and test set
    5. Train and Test your logistic regression model on the projected datasets

    Long answer:

    PCA does not remove the 'least important variables'. PCA is a dimensionality reduction algorithm that is going to find linear combinations of the input features that encode the same amount of information (inertia) using fewer coordinates.

    So if your data has N_Feats you can think of PCA as a matrix of dimension N_Feats x Projection_size where Projection_size < N_Feats that you multiply to your data to get a projection of lower dimension

    This implies that you need all your features(variables) to compute your projection.

    If you think in terms of projections, it doesn't make sense to have 2 different projections for each class. Why? There are 2 reasons:

    1. If you have two PCAs one for each class, when you want to test your model, you will not know which PCA you must apply for each data sample unless you peek on the test labels. This is an unrealistic situation because if you already know the labels you dont need a classifier at all. So if you do this you will get a high performance because you are introducing the label undirectly at the input of your classifier.
    2. If you have two PCAs the coordinates of the projected samples will not have the same meaning depending on the class. It would be like training a classifier on two completely different sources of data that have the same dimention. Like training a logistic regression to distinguish a mice from an elephant then giving 1 feature for the mice that is the weight and giving 1 feature for the elephants which is the size. The logistic regression model will give you an output because numerically a solution can be computed. But it doesn't make sense in terms of methodology.