I am using the Natural NPM package to do some fairly simple text analysis. How every I am finding a huge gap between the results when I process the same sample texts (articles of 600-2000 words) with the LogisticRegressionClassifier and BayesClassifier.
BayesClassifier Results:
mlb
// classifier.getClassifications(data)
[ { label: 'mlb', value: 5.056332563372173e-139 },
{ label: 'nba', value: 5.589251687911356e-164 },
{ label: 'nhl', value: 1.2887446397232257e-165 },
{ label: 'nfl', value: 1.4562872037319007e-167 } ]
mlb // result of classifier.classify(data)
LogisticRegressionClassifier Results:
mlb
//classifier.getClassifications(data)
[ { label: 'mlb', value: 0.9984418828983803 },
{ label: 'nhl', value: 0.008472129523116049 },
{ label: 'nfl', value: 0.0005530225293869185 },
{ label: 'nba', value: 9.776621359081668e-18 } ]
mlb // result of classifier.classify(data)
Obviously the LogisticRegressionClassifier is giving me much better results but it takes significantly longer to process each article. In some cases several minutes. I am using 50 hand selected articles for each category.
My question is, what is the underlying difference between these two processing methods and is there a way I can better prepare my samples for the BayesClassifier (which appears to be much quicker) ie. would it be beneficial to stem the articles before processing them? Any other tips or tricks?
Also, I know there is a bunch of trial and error involved, but based on experience is there a good number of articles to use to train the algorithms? I have tried a range of 10-400 for each and seem to get relatively similar results regardless of sample size.
You might be misunderstanding the output of the getClassifications function. For the bayes classifier those numbers represent the probability of the text given the label. For the logistic regression the numbers represent the probability of each class given the text. In both cases, you should predict the class that has the highest probability. That's the way these classifiers work.
From what you've shown here it is not obvious which one would work better on your data.