machine-learning logistic-regression sentiment-analysis vowpalwabbit

Interpretation of output of Vowpal Wabbit

I am using Vowpal Wabbit for binary sentiment classification (positive and negative) using basic unigram features. This is what my train features look like:

1 | 28060 10778 21397 64464 19255
-1 | 44248 21397 3353 57948 4340 7594 3843 44368 33938 49249 45696     32698 57948 21949 58810 32698 62793 64464
1 | 44248 21397 3353 32698 62996
1 | 44248 21397 3353 57948 63747 40024 46815 37197 7594 47339 28060 10778 32698 45035 3843 54789 19806 60087 7594 47339

Each line starts with the label, followed by a series of indices of words in the vocabulary. These features take a default value of 1.

I use this command to train:

cat trainfeatures.txt | vw --loss_function logistic -f trainedModel

This is the command I use for testing:

cat testfeatures.txt | vw  -i trainedModel -p test.pred

This is what the output file test.pred looks like:

The values range between -0.114076 and 28.641335. If I use a rule that if the value is more than a threshold, say, 14, then it is positive and otherwise it is negative, then I get an accuracy of 51% and f-measure of 40.7%.

But the paper I am following reports an accuracy of 81% on this dataset. So there is definitely something wrong I am doing in my implementation or my interpretation of results. I am unable to figure out what that is.

EDIT: I used the --binary option in the test command and that gave me labels {-1,+1}. I evaluated it and got the following results - accuracy of 51.25% and f-measure of 34.88%.

Solution

EDIT: The main problem was that the training data was not shuffled in random order. This is needed when using any online learning (unless the training data is already shuffled or if it is a real time series). It can be done using Unix command shuf.

Explanation: In an extreme case, if the training data contains first all negative examples followed by all positive examples, then it is quite probable that the model will learn to classify (almost) everything as positive.

Another common reason that might result in low F1-measure (and almost all predictions positive) is imbalanced data (many positive examples, few negative examples). This was not the case of the dataset in Satarupa Guha's question, but I keep my original answer here:

The obvious solution is to give a higher (than the default 1) importance weight to the negative examples. The optimal value of the importance weight can be found using a heldout set.

If I use a rule that if the value is more than a threshold, say, 14, then it is positive and otherwise it is negative

The threshold for negative vs. positive prediction should be 0.

Note that one of the great advantages of Vowpal Wabbit is that you do not need to convert feature names (words in your case) to integers. You can use the raw (tokenized) text, just make sure to escape pipe "|" and colon ":" (and space and newline). Of course, if you already converted the words to integers, you can use it.