Search code examples
machine-learningclassificationbayesiantraining-datanaivebayes

Does my Naive Bayes training data need to be proportional?


I'll use spam classification as an example. The canonical approach would be to hand-classify a random sampling of emails and use them to train the NB classifier.

Great, now say I added a bunch of archived emails that I know are not spam. Would this skew my classifier results because now the proportion of spam:not spam is no longer representative? The two ways I could think of this happening:

  • The features become too non-spam heavy.
  • The algorithm implicitly uses probability(spam) in its classification (in the same way that probability(medical condition) is devalued by the rarity of the medical condition even when the test is positive.

In general, more training data is better than less, so I'd like to add it if it doesn't break the algorithm.


Solution

  • You can train on all data, without worrying about proportionality. That said, as you observed, distorting the proportions distorts the probabilities and results in bad outcomes. If you have a 20% spam email flow and train a spam filter on 99% spam and 1% good email (ham), you're going to end up with a hyper-aggressive filter.

    The common approach to this is two-step:

    1. Seed the filter by running a representative sample of data through it (say, 1,000 emails in the spam filter scenario).
    2. As the filter encounters additional data, only update the weights if the filter gets it wrong. This is called "train-on-error."

    If you follow this approach, your filter will not get confused by a sudden burst of spam that just happens to include, say, the word "trumpet" alongside words that really are spammy. It will adjust only when necessary, but will catch up as quickly as it needs to when it is wrong. This is one way of preventing the "Bayesian poisoning" approach that most spammers now take. They can clutter up their messages with a lot of garbage, but they only have so many ways to describe their products or services, and those words will always be spammy.