machine-learning hash vowpalwabbit hash-collision regularized

Vowpal Wabbit hash collision works better than L1 as regularization

I have VW classification model which i wanted to inspect for number of features and number of hash collisions. I trained it and test it on different datasets. The datasets conatins more than 400k features, so with 18bit VW space, it is possible to save only 260k without collisions.

Then, to inspect it, i created two readable models: one with parameter --readable_model, to get number of all hashes, second one with parameter --invert_hash to get number of all features, even those which were in hash collisions. There were 425k features, and 208k hashes (not 260k, because i used some filtering with --keep argument, and as i understand it, vw saves to hash table also ignored namespaces). Then i measured quality of model, with ROC_AUC, MCC, and Average Precision. Results were 0.903, 0.564, 0.591.

As you can see, hash collision is huge, so i decided to increase bit space with -b argument value set to 24. Then i trained model again. Now, there were 425k features, and 425k hashes = no collisions. However, results on same metrics, were worse (0.902,0554,0.587).

For now everything looks like there was big overfitting with 24bit space, while 18bit prevented model against overfitting better - that would be good explanation why results were better on test dataset with 18bit model.

But then i decided to reduce number of features on 24bit model with L1 regularization. I was playing with it for a while, and when i got model which had 208k hashes and 208k features i was very suprised, that its results were still worse, than results of 18bit model with same number of hashes. It was 0.901,0.584,0.552.

This makes me to believe, that random hash collisions, even for huge portion of features, works like better regularizer, than L1 regularization. How is this possible?

Solution

What is going on here (bottom line)?

What you experience is a strong sign of over-fitting. Details follow.

How is this even possible?

Both regularization and random hash-collisions are discounting mechanisms for a subset of model features. They pick some features, and make them less important (or completely discounted) in the model.

But that's where the similarity ends. The two discounting mechanisms are very different in:

The subset of features they discount (specific vs random)
The discounting method (complete vs partial-blend)
The discounting direction and magnitude

Specific vs random discounting

L1 regularization (--l1 ...) picks very specific weights (the ones nearest zero vs the norm) whereas random hash collisions "picks" random weights, some of which may be large.

Complete vs partial-blend discounting

L1 regularization completely trims/removes the weights it selects, whereas random-hash collisions create blends with other features. In one sense (effect-mixing) blends are similar, though not identical, to what vw feature-crossing with -q <XY> does.

Direction & magnitude of discounting

Unlike specific selection by absolute value of weight, random selection can affect weights of any magnitude.

Moreover, a feature might be a bad feature when considered alone, but actually contribute when considered in combination with another feature. One example would be a blending of two bad features, one with a positive and one with a negative weight. By partially canceling each other they might result in a plausibly good feature that is somewhat correlated with the target label. IOW: feature blending can sometimes turn a bad effect into a good effect.

How can discounting help (or hurt) a model?

This is very common in Machine Learning, and especially in big-data problems. Whether feature selection, trimming, or blending improves accuracy is data-dependent.

If it happens to discount a "bad" feature (one that agrees with the training-data, but doesn't help in generalization) it makes the model better. However, if we discount a good, well-generalizing feature, it would make the model appear worse in testing (out of sample data).

Related: the idea of randomly discounting or dropping features, even potentially important ones, has been proven to be a a powerful and beneficial technique in deep neural-nets (NNs), where it is called dropout. dropout has become the de-rigueur method to avoid over-fitting in deep learning.

Bottom line

Creating good models takes practice. When you have a very large number of features, over-fitting due to random effects (both small and/or large weights) is possible. This over-fitting needs to be avoided.

There are many ways to avoid over-fitting. Regularization is just one specific way to reduce over-fitting. Sometimes, a more significant, random approach, which affects all features, not just those with low weights, might be beneficial overall.

When this happens, it is a hint that the number of features may be too large, and you may be over-fitting them.

Generally I would suspect any model where the number of examples (data-set rows) used to train, is not much larger than the number of features (distinct data-set columns). If you have hundreds of thousands (10^6) of features, you should probably have 10^12 (trillions) of examples to avoid over-fitting.

Another thing I would do with a large number of features, is to randomly-shuffle the order of examples, and blend several models to ensure that a specific order doesn't cause over-fitting. Online learning tends to overweight early examples over later ones due to learning rate decay.