Search code examples
taggingmallettraining-datatopic-modeling

MALLET for automatic topic tagging - with training data


I have a corpus of documents, which I have already tagged. I have fixed list of about 400 tags - relating to different topics. Each document has been tagged with one or more tags, and a short title. (I also have a much larger list of titles - which I often re-use if the document contains very similar content)

I want to make an interface that will suggest tags/titles (from my existing lists) for new documents that I add to the corpus, based on how I have tagged the existing documents.

I have read about the probabilistic topic model LDA classes, which look great for analyzing text when you don't have any existing tagged data. But I don't see any way I can incorporate my existing work.

Any suggestions would be appreciated.

Kind Regards

Swami


Solution

  • For tags suggestion, our experience is just using a search engine, no need for topic modeling.

    Try below steps:

    • Setup an index on title and abstract of all your documents
    • Using the title or abstract of the new document as a query to search on the index, and a list of similar document can be achieved.
    • Using the first few most-similar documents from the list, we aggregate all the tags on them as a tag bundle
    • Sort the tags bundle by frequency of each tag, and the first most-frequent tags are the final result

    This solution is workable.