Search code examples
pythondata-analysispcacentering

Mean centering before PCA


I am unsure if this kind of question (related to PCA) is acceptable here or not.

However, it is suggested to do MEAN CENTER before PCA, as known. In fact, I have 2 different classes (Each different class has different participants.). My aim is to distinguish and classify those 2 classes. Still, I am not sure about MEAN CENTER that should be applied to the whole data set, or to each class.

Is it better to make it separately? (if it is, should PREPROCESSING STEPS also be separately as well?) or does it not make any sense?


Solution

  • PCA is just a rotation, optionally accompanied with a projection onto a lower-dimensional space. It finds axes of maximal variance (which happen to be the principal axes of inertia of your point cloud) and then rotates the dataset to align those axes with your coordinate's system. You get to decide how many such axes you'd like to retain, which means the rotation is then followed by projection onto the first k axes of greatest variance, with k the dimensionality of the representation space you'll have chosen.

    With this in mind, again like for calculating axes of inertia, you could decide to look for such axes through the center of mass of your cloud (the mean), or through any arbitrary origin of choice. In the former case, you would mean-center your data, and in the latter you may translate the data to any arbitrary point, with the result being to diminish the importance of the intrinsic cloud shape itself and increase the importance of the distance between the center of mass and the arbitrary point. Thus, in practice, you would almost always center your data.

    You may also want to standardize your data (center and divide by standard deviation so as to make variance 1 on each coordinate), or even whiten your data.

    In any case, you will want to apply the same transformations to the entire dataset, not class by class. If you were to apply the transformation class by class, whatever distance exists between the centers of gravity of each would be reduced to 0, and you would likely observe a collapsed representation with the two classes as overlapping. This may be interesting if you want to observe the intrinsic shape of each class, but then you would also apply PCA separately for each class.

    Please note that PCA may make it easier for you to visualize the two classes (without guarantees, if the data are truly n-dimensional without much of a lower-dimensional embedding). But in no circumstances would it make it easier to discriminate between the two. If anything, PCA will reduce how discriminable your classes are, and it is often the case that the projection will intermingle classes (increase ambiguity) that are otherwise quite distinct and e.g. separable with a simple hyper-surface.