Search code examples
nlpwikipediafasttextcategorizationgoogle-natural-language

How to get general categories for text using NLP like fasttext?


I am working on an application and I would like to infer general categories from the text using natural language processing. I am new to Natural Language Processing (NLP).

The Google Natural Language API does this using a reasonable high-level set of content categories such as "/Arts & Entertainment", "/Hobbies & Leisure", etc:

https://cloud.google.com/natural-language/docs/categories

I am hoping to do this using open source and would like to use some general categories such as the Wikipedia high level classifications:

https://en.wikipedia.org/wiki/Category:Main_topic_classifications

fasttext seems like a good option but I'm struggling to find a corpus to use for training. I do see the wikipedia word vector files and can get the full wikipedia download but I don't see an easy way to get the articles tagged with the categories for fasttext.

Is there some open source tool that can identify high-level general categories given some text -- or is there a training dataset I could use?


Solution

  • I'd suggest using the "zero-shot classification" pipeline the HuggingFace Transformers library. It's very easy to use and has decent accuracy given that you don't need to train anything yourself. Here is an interactive web application to see what it does without coding. Here is a Jupyter notebook which demonstrates how to use it in Python. You can just copy-paste code from the notebook.

    This would look something like this:

    # pip install transformers==3.4.0  # pip install in terminal
    from transformers import pipeline
    
    classifier = pipeline("zero-shot-classification")
    
    sequence = "I like just watching TV during the night"
    candidate_labels = ["arts", "entertainment", "politics", "economy", "cooking"]
    
    classifier(sequence, candidate_labels)
    
    # output: 
    'labels': ['entertainment', 'economy', 'politics', 'arts', 'cooking'],
    'scores': [0.939170241355896, 0.13490302860736847, 0.011731419712305069, 0.0025395064149051905, 0.00018942927999887615]
    

    Here are details on the theory, if you are interested.