Search code examples
pythonnlpdata-sciencedata-analysistagging

Which learning model should be chosen to predict text news tags?


I have a database of news texts (100000 samples). Half of the dataset is tagged, and half is not, what methodology can I use to analyze the remaining news and fill them with tags?

Data example:

Text = A cap on the price of Russian oil will restrict Russia's revenues for its "illegal war Ukraine", the US says. The cap, approved by Western allies on Friday, is aimed at stopping countries paying more than $60 (£48) for a barrel of seaborne Russian crude oil. The measure - due to come into force on Monday - intensifies Western pressure on Russia over the invasion… [long test is cut]

Tags = ['russian', 'oil', 'war']

I know how to use python, pandas. But I found only methods that predict whether the text is bad or good.


Solution

  • There are quite a few NLP content tagging methods: keyphrase-based, classification-based, custom methods with some rules determined (if you know the principles, how the tags were setup manually). Try combining them.

    1. Split your datasets into tagged and untagged parts.
    2. Tagged dataset is the one you can experiment with: split into train-validation-test, check the metrics.
    3. Analyze which tags are missing/added mistakenly and fine-tune the solution.

    Explore the articles:

    [LinkedIn] Automatic Content Tagging using NLP and Machine Learning

    [medium] k-MeanCeption: How to automatically tag news articles using clustering algorithms?