machine-learning nlp text-classification naivebayes

Is Naive Bayes biased?

I have a use case where in text needs to be classified into one of the three categories. I started with Naive Bayes [Apache OpenNLP, Java] but i was informed that the algorithm is biased, meaning if my training data has 60% of data as classA and 30% as classB and 10% as classC then the algorithm tends to biased towards ClassA and thus predicting the other class texts to be of classA.

If this is true is there a way to overcome this issue?

There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my use case. Please advise.

Solution

there a way to overcome this issue?

Yes, there is. But first you need to understand why it happens?

Basically your dataset is imbalanced.

An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.

In this scenario, your model becomes bias towards the class with majority of samples as you have more training data for that class.

Solutions

Under sampling: Randomly removing samples from majority class to make dataset balance.
Over sampling: Adding more samples of minority classes to makes dataset balance.
Change Performance Metrics Use F1-score, 'recallorprecision` to measure the performance of your model.

There are few more solutions, if you want to know more refer this blog

There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my usecase

You will never know unless you try, I would suggest you try 3-4 different algorithms on your data.