Search code examples
machine-learningstatisticstext-classificationnaivebayes

Machine learning text classification where a text belongs to 1 to N classes


So I am trying to (just for fun) classify movies based on their description, the idea is to "tag" movies, so a given movie might be "action" and "humor" at the same time for example.

Normally when using a text classifier, what you get is the class to where a given text belongs, but in my case I want to assign a text to 1 to N tags.

Currently my training set would look like this

+--------------------------+---------+
|        TEXT              |  TAG    |
+--------------------------+---------+
| Some text from a movie   |  action |
+--------------------------+---------+
| Some text from a movie   |  humor  |
+--------------------------+---------+
| Another text here        | romance |
+--------------------------+---------+
| Another text here        | cartoons|
+--------------------------+---------+
| And some text more       | humor   |
+--------------------------+---------+

What I am doing next is to train classifiers to tell me whether or not each tag belongs to a single text, so for example, if I want to figure out whether or not a text is classified as "humor" I would end up with the following training set

+--------------------------+---------+
|        TEXT              |  TAG    |
+--------------------------+---------+
| Some text from a movie   |  humor  |
+--------------------------+---------+
| Another text here        |not humor|
+--------------------------+---------+
| And some text more       | humor   |
+--------------------------+---------+

Then I train a classifier that would learn whether or not a text is humor or not (the same approach is done with the rest of the tags). After that I end with a total of 4 classifiers that are

  • action / no action
  • humor / no humor
  • romance / no romance
  • cartoons / no cartoons

Finally when I get a new text, I apply it to each of the 4 classifiers, for each classifier that gives me a positive classification (that is, gives me X instead of no-X) if such classification is over a certain threshold (say 0.9), then I assume that the new text belongs to tag X, and then I repeat the same with each of the classifiers.

In particular I am using Naive Bayes as algorithm, but the same could be applied with any algorithm that outputs a probability.

Now the question is, is this approach correct? Am I doing something terribly wrong here? From the results I get things seems to make sense, but I would like a second opinion.


Solution

  • Yes, this makes sense. it is a well known, basic technique for multilabel/multiclass classification known as "one vs all" (or "one vs all") classifier. This is very old and widely used. On the other hand - it is also very naive as you do not consider any relations between your classes/tags. You might be interested in reading about structure learning, which covers topics where there is some structure over labels space that can be exploited (and usually there is).