Search code examples
pythonpca

Normalization before PCA on different data types


Prior to running principal component analysis you should normalize the data as to not have the results skewed. Under normal situations, this is a fairly simple task. I am curious how I should go about normalizing my data, which contains multiple data types within the data set. Some I know (strongly believe) are very important. Others I am not so sure, but that is why I wanted to run PCA on my data set.

    0       1       2       3       4    ...
  0.112   'Bob'   68.47   'Right'  9493  ...

Something like this, where there could be a string that has no categorical backing, such as a name. While 'Right' could be enumerated to a category.

I am not sure this is even necessary but I would appreciate some suggestions.


Solution

  • First you should be very careful when running PCA on variables that have no inherent order. Such as categorical data.

    Second, think what does even mean to apply PCA to things like names. PCA works on vectors which are lengths that have a direction. What is the length of bob and which direction would it be pointing?

    One thing you can try is to convert your string data to N-Grams which would be perfect vectors. Another thing to try is to apply TF-IDF conversion, which again would give you a vector.

    Once you applied one of this conversions. You've got a problem of having vectors embedded within vectors. You can try combining those into one vector by concatenation and normalization. Or you can abandon PCA and treat your dataset as collection of tensors and apply something like multilinear component analysis which is an extension of PCA to tensors.

    Note either of those aproaches require will produce giant vectors, so you need to have a lot of data instances to get anything meaningful out your analysis.