I created a word sentiment app using the Naive Bayes algorithm.
There are two types of criteria in this classification training data, that is positive training data and negative training data. I take a unique word on every training data that has been grouped. so, I have all the unique words for each data criteria. Then, I calculate the probability value of occurrence of each unique word.
The problem is when I use uneven training data. For example: I use 60% of negative training data and 40% positive training data. Then the results of test data will be more likely to negative results, and vice versa.
Besides I have to use balanced data, what should I do to solve this problem? and is there an additional method I should add?
Naive Bayes requires balanced training data because the likelihood of each parameter is influenced by the prior value (priority value).
this prior value is taken from the classes of each data. maybe you already understand when I explain this kind of thing.