python machine-learning text-classification countvectorizer

How to make text classification gives a None category

I'm doing text classification for dialects. After I trained it for 3 types of dialects, I tested it with the test data I have. However, now suppose I'm going to extract a tweet from twitter, and ask the classifier to output the corresponding dialect, but what if the tweet wasn't written in any of those 3 dialects? I assume that he will give a category regardless, but that would be false positive. Therefore, I want him to give a None category. How to do that? Should I also give training data with None labels?

Solution

If you want to predict a new category (in this case None) with the same classifier, you have to provide training data corresponding to this category.

Another idea (better discussed here: https://stats.stackexchange.com/questions/174856/semi-supervised-classification-with-unseen-classes) is to train a multi-class classifier which assigns a sentence to one of the dialects; then train various one-class classifiers, one for each dialect, which can confirm or deny multi-class classifier predictions.

An example:
Dialects A, B, C.

Multi-class classifier assigns sentence to dialect A.
One-class classifier for dialect A classifies sentence as dialect A.
Sentence belongs to dialect A.

Multi-class classifier assigns sentence to dialect A.
One-class classifier for dialect A classifies sentence as not dialect A.
Sentence belongs to unknown dialect (None).