I am using libsvm for my document classification.
I use svm.h and svm.cc only in my project.
Its struct svm_problem requires array of svm_node that are non-zero thus using sparse.
I get a vector of tf-idf words with lets say in range [5,10]. If i normalize it to [0,1], all the 5's would become 0.
Should i remove these zeroes when sending it to svm_train ?
Does removing these would not reduce the information and lead to poor results ?
should i start the normalization from 0.001 rather than 0 ?
Well, in general, in SVM does normalizing in [0,1] not reduces information ?
SVM
is not a Naive Bayes
, feature's values are not counters, but dimensions in multidimensional real valued space, 0's have exactly the same amount of information as 1's (which also answers your concern regarding removing 0 values - don't do it). There is no reason to ever normalize data to [0.001, 1]
for the SVM
.
The only issue here is that column-wise normalization is not a good idea for the tf-idf
, as it will degenerate yout features to the tf
(as for perticular i
'th dimension, tf-idf
is simply tf
value in [0,1]
multiplied by a constant idf
, normalization will multiply by idf^-1
). I would consider one of the alternative preprocessing methods:
x=C^-1/2*x
, where C
is data covariance matrix