ontology n-gram document-classification vowpalwabbit

Document multi-label classification - where do you get the labels? Ontology?

I am familiar with data mining techniques but not so much with text mining or Web mining.

Here is a simple task: classify articles into a set of categories. Let us assume, I extracted text content of the article and processed it.

How and where do you get the categories - pre-defined labels? Is it possible to plug-in an ontology, taxonomy for that and go as granular as needed? Classification task will be a multi-label classification.

Do we use n-grams in this case for approximate matching?

Currently I have themes and named entities extracted from the text. Can I use Vowpal Wabbit for that?

Solution

How and where do you get the categories - pre-defined labels?

There are many benchmark text datasets with taxonomy and ontology information. Wordnet is one such a popular benchmark dataset used in text analysis research. This is the first paper that focused on using taxonomy to arrive at a semantic similarity for text analysis on Wordnet. . This is a more recent good paper dealing with similar objective.

Is it possible to plug-in an ontology, taxonomy for that and go as granular as needed?

Yes. There is a research subfield that deals with arriving at a semantic similarity based on taxonomy and ontology that exist among the concepts (in this case, concepts in text documents). This paper provides an overview and comparative study of techniques that bring in ontology and taxonomy into measuring similarities among documents. //go as granular as needed// - Yes, you can do so, by arriving at a new similarity measure that controls the granularity. Many research work pertain to this. This paper is a recent example.

Do we use n-grams in this case for approximate matching?

Yes possible, but the aforementioned papers use less granular approaches that model concepts from documents. Most of them use tf-idf and not n-grams of terms.