Search code examples
pythonmachine-learningtext-classificationcountvectorizer

How to make text classification gives a None category


I'm doing text classification for dialects. After I trained it for 3 types of dialects, I tested it with the test data I have. However, now suppose I'm going to extract a tweet from twitter, and ask the classifier to output the corresponding dialect, but what if the tweet wasn't written in any of those 3 dialects? I assume that he will give a category regardless, but that would be false positive. Therefore, I want him to give a None category. How to do that? Should I also give training data with None labels?


Solution

  • If you want to predict a new category (in this case None) with the same classifier, you have to provide training data corresponding to this category.

    Another idea (better discussed here: https://stats.stackexchange.com/questions/174856/semi-supervised-classification-with-unseen-classes) is to train a multi-class classifier which assigns a sentence to one of the dialects; then train various one-class classifiers, one for each dialect, which can confirm or deny multi-class classifier predictions.

    An example:
    Dialects A, B, C.

    Multi-class classifier assigns sentence to dialect A.
    One-class classifier for dialect A classifies sentence as dialect A.
    Sentence belongs to dialect A.

    Multi-class classifier assigns sentence to dialect A.
    One-class classifier for dialect A classifies sentence as not dialect A.
    Sentence belongs to unknown dialect (None).