I'm planning to use Python Scikit to do some text classification, and was planning to use using TfidfVectorizer and MultinomialNB.
but I realized that MultinomialNB will always predict my sample into an existing (known) category.
for example, if I have:
category A: trained with sample "this is green"
category B: trained with sample "this is blue"
category C: trained with sample "this is red"
and I try to predict: "this is yellow"
it will give me category A
(or any other, because the probablity is the same for all categories in this case).
my question is: is there a classifier that would give me "unknown" (or none, or false, or error) for the test case above?
I would like to know when my test case could not be predicted with the given training set.
I think I could check if my_classifier.predict_proba(X_test))
returns an array with all equal or close values (in this example case: [[ 0.33333333 0.33333333 0.33333333]]
).
actually, I would have to check if the values are close to their defaults, because the probabilities might not be the same for each category :)
so... any better approach or... is there a classifier with some confidence threshold I could use?
You can look into doing novelty detection. I would check out that link and the associated example. The idea, in that example, is to use a:
One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar or different to the training set.
(Emphasis is mine.) I don't know how it would perform with the small amount of data in your example, I'd guess "poorly", but I believe that novelty detection is the sort of thing you are looking for here.