How to improve a German text classification model in spaCy

I am working on a text classification project and using spacy for this. Right now I have an accuracy equal to almost 70% but that is not enough. I've been trying to improve the model for past two weeks, but no successful results so far. And here I am looking for an advice about what I should do or try. Any help would be highly appreciated!

So, here is what I do so far:

1) Preparing the data:

I have an unbalanced dataset of German news with 21 categories (like POLITICS, ECONOMY, SPORT, CELEBRITIES etc). In order to make categories equal I duplicate small classes. As a result I have 21 files with almost 700 000 lines of text. I then normalize this data using the following code:

import spacy
from charsplit import Splitter

POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM']  # allowed parts of speech

nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()

def normalizer(texts):
    arr = []  # list of normalized texts (will be returned from the function as a result of normalization)

    docs = nlp_helper.pipe(texts)  # creating doc objects for multiple lines
    for doc in docs:  # iterating through each doc object
        text = []  # list of words in normalized text
        for token in doc:  # for each word in text
            token = token.lemma_.lower()

            if token not in stop_words and token.pos_ in POS:  # deleting stop words and some parts of speech
                if len(word) > 8 and token.pos_ == 'NOUN':  # only nouns can be splitted
                    _, word1, word2 = splitter.split_compound(word)[0]  # checking only the division with highest prob
                    word1 = word1.lower()
                    word2 = word2.lower()
                    if word1 in german and word2 in german:
                        text.append(word1)
                        text.append(word2)
                    elif word1[:-1] in german and word2 in german:  # word[:-1] - checking for 's' that joins two words
                        text.append(word1[:-1])
                        text.append(word2)
                    else:
                        text.append(word)
                else:
                    text.append(word)
        arr.append(re.sub(r'[.,;:?!"()-=_+*&^@/\']', ' ', ' '.join(text))) # delete punctuation
    return arr

Some explanations to the above code:

POS - a list of allowed parts of speech. If the word I'm working with at the moment is a part of speech that is not in this list -> I delete it.

stop_words - just a list of words I delete.

splitter.split_compound(word)[0] - returns a tuple with the most likely division of the compound word (I use it to divide long German words into shorter and more widely used). Here is the link to the repository with this functionality.

To sum up: I find the lemma of the word, make it lower case, delete stop words and some parts of speech, divide compound words, delete punctuation. I then join all the words and return an array of normalized lines.

2) Training the model

I train my model using de_core_news_sm (to make it possible in the future to use this model not only for classification but also for normalization). Here is the code for training:

nlp = spacy.load('de_core_news_sm')

textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
    textcat.add_label(category)

pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    for i in range(n_iter):
        shuffle(data)
        batches = spacy.util.minibatch(data)

        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.25)

Some explanations to the above code:

data - list of lists, where each list includes a line of text and a dictionary with categories (just like in the docs)

'categories' - list of categories

'n_iter' - number of iterations for training

3) At the end I just save the model with to_disk method.

With the above code I managed to train a model with 70% accuracy. Here is a list of what I've tried so far to improve this score:

1) Using another architecture (ensemble) - didn't give any improvements

2) Training on non normalized data - the result was much worse

3) Using pretrained BERT model - could'n do it (here is my unanswered question about it)

4) Training de_core_news_md instead of de_core_news_sm - didn't give any improvements (tried it because according to the docs there could be an improvement thanks to the vectors (if I understood it correctly). Correct me if I'm wrong)

5) Training on data, normalized in a slightly different way (without lower casing and punctuation deletion) - didn't give any improvements

6) Changing dropout - didn't help

So right now I am a little stuck about what to do next. I would be very grateful for any hint or advice.

Thanks in advance for your help!

Solution

The first thing I would suggest is increasing your batch size. After that your optimizer (Adam if possible) and learning rate for which I don't see the code here. You can finally try changing your dropout.

Also, if you are trying neural networks and plan on changing a lot, it would be better if you could switch to PyTorch or TensorFlow. In PyTorch, you will have HuggingFace library, which has BERT in-built in it.

Hope this helps you!