Search code examples
machine-learningnlptopic-modelingmalletrelevance

Document relevancy score based on topic modelling


I currently have a trained topic model using MALLET (http://mallet.cs.umass.edu/topics.php) that is based on about 80 000 collected news articles (these articles all belong to one category).

I wish to give a relevancy score each time a new article comes in (might or might not be related to the category). Is there any way to achieve this? I've read up on td-idf, but it seems that is giving a score based on existing articles, not any incoming new ones. The end goal is to filter out articles that might be irrelevant.

Any ideas or help is greatly appreciated. Thank you!


Solution

  • After you have the model(topics) you can test on new unseen documents as per documentation (parameter --evaluator-filename [FILENAME] is where you pass the new unseen documents) Topic Held-out probability:

    Topic Held-out probability

    --evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate the log probability of new documents, marginalized over all topic configurations. Use the MALLET command bin/mallet evaluate-topics --help to get information on using held-out probability estimation. As with topic inference, you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.

    Note: I did used more the gensim LDA and LSI and you can pass the new documents as follow:

    new_doc = "Human computer interaction"
    new_vec = dictionary.doc2bow(new_doc.lower().split())
    print(lda_model[new_vec])
    
    #output: [(0, 0.020229542), (1, 0.49642297)
    

    Interpretation: you can see (1, 0.49642297) meaning that from the 2 topics(categories) we have the new document is close represented by topic #1. So in your case you can take the maximum from the outputting list and you have the relevancy "coefficient" so high coefficient to be in the category and low not (added 2 topics as per better visualization and in your case if you have only #1 topic than just add a simple threshold of the minim you want to consider and if did fail above, for example 0.40, than is in the category otherwise not).