How does feature bits work in vowpal wabbit

I am relatively new to vowpal wabbit and would like to find out about the -b parameter (feature bits in feature table).

My training data are something like this. There are a total of about 1 million words.

1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"

However, each row would only have 5 features. how many bits should i use? I tried to run it and it seems with an increasing number of examples, the number of features set seem to increase. I do not seem to understand why is this so.

Solution

If you use -b 18 (which is the default), the features will be hashed into a table with 2^18 items, so if the number of unique features in your dataset is close to 2^18 (or even higher), you should increase the parameter -b, so there are not so many hash collisions. There is no easy way how to detect the number of collisions, but the common practice is to tune the parameter -b for a best progressive validation loss (or holdout loss if you use more passes). Of course, it also depends on the available memory on your machine.

1 | a = "word" b ="word131232" c="word1233" d = "word123124" e = "word23145"

Note that this example is wrong (not what you intended) because of the spaces around =. The equal sign has no special meaning (unlike colon which is used for separating the feature value). Features cannot contain space in their name. There is no need to enclose feature names in quotes. So the example should look like

1 | word word131232 word1233 word123124 word23145

If the prefix a, b, c, d, e has some special meaning (i.e. a=word42 should be a different feature than b=word42) you can use:

1 | a=word b=word131232 c=word1233 d=word123124 e=word23145

If all your words are already mapped to integers (within the range 0-2^b), you can use them directly as feature names and no hashing will be done (unless you specify --hash=all):

1 | 0 131232 1233 123124 23145

See the wiki page about input format.

the number of features set seem to increase

In the progress report (by default each 2^x th example), in the last column you can see current features, which is the number of features for the current example (including the constant feature and quadratic/cubic/... features if you use them) and it should not be increasing (unless you have such strange data).

In the final report, vw prints total feature number, which is the average number of features per example times the number of examples times the number of passes (so it is not the number of unique features in the dataset).