Search code examples
pythonnlptext-classificationbert-language-modelhuggingface-transformers

HuggingFace Transformers model for German news classification


I've been trying to find a suitable model for my project (multiclass German text classification) but got a little confused with the models offered here. There are models with text-classification tag, but they are for binary classification. Most of the other models are for [MASK] word predicting. I am not sure, which one to choose and if it will work with multiple classes at all

Would appreciate any advice!


Solution

  • You don't need to look for a specific text classification model when your classes are completely different because most listed models used one of the base models and finetuned the base layers and trained the output layers for their needs. In your case you will remove the output layers and their finetuning of the base layers will not benefit or hurt you much. Sometimes they have extended the vocabulary which could be beneficial for your task but you have to check description (which is often sparse :() and the vocabulary by yourself to get more details about the respective model.

    In general I recommend you to work with one of the base models right away and only look for other models in case of insufficient results.

    The following is an example for bert with 6 classes:

    from transformers import BertForSequenceClassification, BertTokenizer
    
    tokenizer = BertTokenizer.from_pretrained("bert-base-german-dbmdz-uncased")
    model = BertForSequenceClassification.from_pretrained("bert-base-german-dbmdz-uncased", num_labels=6)