python scikit-learn logistic-regression naivebayes

class_weight = 'balanced' equivalent for naive bayes

I'm performing some (binary)text classification with two different classifiers on the same unbalanced data. i want to compare the results of the two classifiers.

When using sklearns logistic regression, I have the option of setting the class_weight = 'balanced' for sklearn naive bayes, there is no such parameter available.

I know, that I can just randomly sample from the bigger class in order to end up with equal sizes for both classes, but then the data is lost.

Why is there no such parameter for naive bayes? I guess it has something to do with the nature of the algorithm, but cant find anything about this specific matter. I also would like to know what the equivalent would be? How to achieve a similar effect (that the classifier is aware of the imbalanced data and gives more weight to the minority class and less to the majority class)?

Solution

I'm writing this partially in response to the other answer here.

Logistic regression and naive Bayes are both linear models that produce linear decision boundaries.

Logistic regression is the discriminative counterpart to naive Bayes (a generative model). You decode each model to find the best label according to p(label | data). What sets Naive Bayes apart is that it does this via Bayes' rule: p(label | data) ∝ p(data | label) * p(label).

(The other answer is right to say that the Naive Bayes features are independent of each other (given the class), by the Naive Bayes assumption. With collinear features, this can sometimes lead to bad probability estimates for Naive Bayes—though the classification is still quite good.)

The factoring here is how Naive Bayes handles class imbalance so well: it's keeping separate books for each class. There's a parameter for each (feature, label) pair. This means that the super-common class can't mess up the super-rare class, and vice versa.

There is one place that the imbalance might seep in: the p(labels) distribution. It's going to match the empirical distribution in your training set: if it's 90% label A, then p(A) will be 0.9. If you think that the training distribution of labels isn't representative of the testing distribution, you can manually alter the p(labels) values to match your prior belief about how frequent label A or label B, etc., will be in the wild.