I'll use spam classification as an example. The canonical approach would be to hand-classify a random sampling of emails and use them to train the NB classifier.
Great, now say I added a bunch of archived emails that I know are not spam. Would this skew my classifier results because now the proportion of spam:not spam is no longer representative? The two ways I could think of this happening:
In general, more training data is better than less, so I'd like to add it if it doesn't break the algorithm.
You can train on all data, without worrying about proportionality. That said, as you observed, distorting the proportions distorts the probabilities and results in bad outcomes. If you have a 20% spam email flow and train a spam filter on 99% spam and 1% good email (ham), you're going to end up with a hyper-aggressive filter.
The common approach to this is two-step:
If you follow this approach, your filter will not get confused by a sudden burst of spam that just happens to include, say, the word "trumpet" alongside words that really are spammy. It will adjust only when necessary, but will catch up as quickly as it needs to when it is wrong. This is one way of preventing the "Bayesian poisoning" approach that most spammers now take. They can clutter up their messages with a lot of garbage, but they only have so many ways to describe their products or services, and those words will always be spammy.