Search code examples
machine-learningnaivebayessupervised-learningdocument-classification

How to identify document categories like Movie or Biography


I am currently running a task of categorizing some documents into some pre-defined sets of classes. For this, I am relying on Multinomial Naive Bayes, and it works fine for most categories like baseball, sports or space.

However, how do I find out articles of categories like movies or biography of some person? MNB primarily runs on bag of words jargon approach. That is why it is easy to detect baseball articles, because they will contain lots of baseball jargon. However, movie or biography articles contain very less jargon. Movie documents describe the movie, or review it, with words specific to that movie only. So an article about A Few Good Men may contain lots of legal terms, which may lead to inadvertently labeling it as "Law". Same for biographies, it just describes the life of a person.

How to classify such kind of documents?


Solution

  • A good solution is to use Named Entity Recognition and Semi supervised approach. For example you tagged name of actors in a sentences(With Entity Extraction semi supervised methods, check this), and get count of specific entity(e.g: the more the count of actors(our entity) repeated in a sentence, the more the sentence is related to movies). Then add it to a feature, so it might be representative and important for classifier, try to find such these features from your data sets and feed your classifier with these

    You can check the effectiveness and impact of any added feature with measurement like Chi2 or ANOVA F Value