Search code examples
machine-learningnlpclassificationtaggingtext-classification

Document Tagging with Named Topics, relevant literature? (Also asked on Quora)


I am working on what is to me a very new domain in data science and would like to know if anyone can suggest any existing academic literature that has relevant approaches that address my problem.

The problem setting is as follows: I have a set of named topics (about 100 topics). We have a document tagging engine that tags documents (news articles in our case) based on their text with up to 5 of these 100 topics.

All this is done using fairly rudimentary similarity metrics (each topic is a text vector and so is each document and we do a similarity between these vectors and assign the 5 most similar topics to each document).

We are looking to improve the quality of this process but the constraint is we have to maintain the set of 100 named topics which are vital for other purposes so unsupervised topic models like LDA are out because: 1. They don't provide named topics 2. Even if we are able to somehow map distributions of topics output by LDA to existing topics, these distributions will not remain constant and vary with the underlying corpus.

So could anyone point me towards papers that have worked with document tagging using a finite set of named topics?

There are 2 challenges here: 1. Given a finite set of named topics , how to tag new documents with them? (this is the bigger more obvious challenge) 2. How do we keep the topics updated with the changing document universe? Any work that addresses one or both of these challenges would be a great help.

P.S. I've also asked this question on Quora if anyone else is looking for answers and would like to read both posts. I'm duplicating this question as I feel it is interesting and I'd like to get as many people talking about this problem as possible and as many literature suggestions as possible.

Same Question on Quora


Solution

  • Have you tried classification?

    Train a classifier for each topic.

    Tag with the 5 most likely classes.