Search code examples
vectormachine-learningscikit-learnsvcsupervised-learning

Can a very large (or very small) value in feature vector using SVC bias results? [scikit-learn]


I am trying to better understand how the values of my feature vector may influence the result. For example, let's say I have the following vector with the final value being the result (this is a classification problem using an SVC, for example):

0.713, -0.076, -0.921, 0.498, 2.526, 0.573, -1.117, 1.682, -1.918, 0.251, 0.376, 0.025291666666667, -200, 9, 1

You'll notice that most of the values center around 0, however, there is one value that is orders of magnitude smaller, -200.

I'm concerned that this value is skewing the prediction and is being weighted unfairly heavier than the rest simply because the value is so much different.

Is this something to be concerned about when creating a feature vector? Or will the statistical test I use to evaluate my vector control for this large (or small) value based on the training set I provide it with? Are there methods available in sci-kit learn specifically that you would recommend to normalize the vector?

Thank you for your help!


Solution

  • Yes, it is something you should be concerned about. SVM is heavily influenced by any feature scale variances, so you need a preprocessing technique in order to make it less probable, from the most popular ones:

    1. Linearly rescale each feature dimension to the [0,1] or [-1,1] interval
    2. Normalize each feature dimension so it has mean=0 and variance=1
    3. Decorrelate values by transformation sigma^(-1/2)*X where sigma = cov(X) (data covariance matrix)

    each can be easily performed using scikit-learn (although in order to achieve the third one you will need a scipy for matrix square root and inversion)