machine-learning normalization text-mining feature-extraction word-frequency

Word Frequency Feature Normalization

I am extracting the features for a document. One of the features is the frequency of the word in the document. The problem is that the number of sentences in the training set and test set is not necessarily the same. So, I need to normalized it in some way. One possibility (that came to my mind) was to divide the frequency of the word by the number of sentences in the document. By my supervisor told me that it's better to normalize it in a logarithmic way. I have no idea what does that mean. Can anyone help me?

Thanks in advance,

PS: I also saw this topic, but it didn't help me.

Solution

The first question to ask is, what algorithm you are using subsequently? For many algorithms it is sufficient to normalize the bag of words vector, such that it sums up to one or that some other norm is one.

Instead of normalizing by the number of sentence you should, however, normalize by the total number of words in the document. Your test corpus might have longer sentences, for example.

I assume the recommendation of your supervisor means that you do not report the counts of the words but the logarithm of the counts. In addition I would suggest to look into the TF/IDF measure in general. this is imho more common in Textmining