machine-learning normalization svm libsvm document-classification

Libsvm: SVM normalizing starts from 0 or 0.001

I am using libsvm for my document classification.

I use svm.h and svm.cc only in my project.

Its struct svm_problem requires array of svm_node that are non-zero thus using sparse.

I get a vector of tf-idf words with lets say in range [5,10]. If i normalize it to [0,1], all the 5's would become 0.

Should i remove these zeroes when sending it to svm_train ?

Does removing these would not reduce the information and lead to poor results ?

should i start the normalization from 0.001 rather than 0 ?

Well, in general, in SVM does normalizing in [0,1] not reduces information ?

Solution

SVM is not a Naive Bayes, feature's values are not counters, but dimensions in multidimensional real valued space, 0's have exactly the same amount of information as 1's (which also answers your concern regarding removing 0 values - don't do it). There is no reason to ever normalize data to [0.001, 1] for the SVM.

The only issue here is that column-wise normalization is not a good idea for the tf-idf, as it will degenerate yout features to the tf (as for perticular i'th dimension, tf-idf is simply tf value in [0,1] multiplied by a constant idf, normalization will multiply by idf^-1). I would consider one of the alternative preprocessing methods:

normalizing each dimension, so it has mean 0 and variance 1
decorrelation by making x=C^-1/2*x, where C is data covariance matrix