python-3.x machine-learning classification nltk naivebayes

How can nltk naivebayes classifier learn more featuresets after the train ends?

I'm now making the nltk_classifier classifying sentence's category.

So I already trained classifier using 11000 sentences' featuresets.

train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = naivebayes.NaiveBayesClassifier.train(train_set)

But I want to add more (sentence,category) featuresets for upgrading classifier. The only way I know is that I append featuresets to list of alreay learned featuresets. That way would make new classifier. But I think that this method is not efficient because It took a lot of time to train one or less more sentence.

Is there any good way to improve classifier's quality by adding featuresets???

Solution

Two things.

Naive Bayes is usually super fast. It only visits all your training data for one time and accumulates the feature-class co-occurrence stats. After that, it uses that stats to build the model. Usually it's not a problem to just re-train your model with new (incremental) data.
It's doable to not redo the steps above when new data comes as long as you still have the feature-class stats stored somewhere. Now you just visit the new data the same way as you did in step 1 and keep updating the feature-class co-occurrence stats. At the end of day, you have new numerators (m) and denominators (n), which applies to both class priors P(C) and the probability of feature given a class P(W|C). You could derive the probabilities by m/n.

Friendly reminder of Bayesian formulas in document classification:

-- Given a document D, the probability that the document falls in category of C_j is:

P(C_j|D) = P(D|C_j)*P(C_j)/P(D)

-- That probability is proportional to:

P(C_j|D) ~ P(W1|C_j) P(W2|C_j) ... P(Wk|C_j) * P(C_j)

based on:

naive bayes assumption (all words, e.g., W1, W2, ..., Wk in the doc are independent),
throwing away P(D) because every class have the same P(D) as denominator (thus we say proportional not equal to).

-- Now all probabilities on the right side could be computed by a corresponding fraction (m/n), where m and n are stored (or can be derived) in the feature-class co-occurrence matrix.