Search code examples
pythonnlpspacy

Problems with Named Entity Recognition in spaCy using German de_dep_news_trf Pipeline


I'm currently working on a project using spaCy with the German trained pipeline de_dep_news_trf.

Unfortunately, I'm having issues with named entity recognition (NER).

When I run a simple sentence like "Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.", no entities are detected.

I've followed these steps to set up my Python environment (3.12)(Windows) in a PyCharm Community project:

python.exe -m pip install --upgrade pip
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download de_dep_news_trf --timeout 600
pip install spacy[transformers]

Here is a snippet of my code:

import spacy


def process_text_with_spacy(text_to_process):
    doc = nlp(text_to_process)
    data = {
        "text": text_to_process,
        "sentences": []
    }
    for sent in doc.sents:
        process_sentence_data = {
            "sentence": sent.text,
            "entities": []
        }
        for ent in sent.ents:
            process_sentence_data["entities"].append({
                "text": ent.text,
                "start": ent.start_char,
                "end": ent.end_char,
                "label": ent.label_
            })
        data["sentences"].append(process_sentence_data)
    return data


nlp = spacy.load('de_dep_news_trf')

sample_text = "Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin."

processed_data = process_text_with_spacy(sample_text)

print("Text:", sample_text)
for sentence_data in processed_data["sentences"]:
    print("Sentence:", sentence_data["sentence"])
    print("Entities:", sentence_data["entities"])

Output:

Text: Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.
Sentence: Berlin ist die Hauptstadt von Deutschland.
Entities: []
Sentence: Angela Merkel war die Bundeskanzlerin.
Entities: []

When using de_core_news_lg, the output for each sentence is:

Text: Berlin ist die Hauptstadt von Deutschland. Angela Merkel war die Bundeskanzlerin.
Sentence: Berlin ist die Hauptstadt von Deutschland.
Entities: [{'text': 'Berlin', 'start': 0, 'end': 6, 'label': 'LOC'}, {'text': 'Deutschland', 'start': 30, 'end': 41, 'label': 'LOC'}]
Sentence: Angela Merkel war die Bundeskanzlerin.
Entities: [{'text': 'Angela Merkel', 'start': 43, 'end': 56, 'label': 'PER'}]

However, when I use de_dep_news_trf, the results are empty. Model de_dep_news_trf is selected based on "accuracy" from the SpaCy website.

Could someone explain why de_dep_news_trf does not return the same result? Is there a specific reason or setting that could cause this difference?

Thank you for your help!


Solution

  • Problem is because this model doesn't have function to recognize entities.

    See documentation for de_dep_news_trf - it has components transformer, tagger, morphologizer, parser, lemmatizer, attribute_ruler but no ner for EntityRecognizer

    So it may need to use one of other models :

    • de_core_news_sm
    • de_core_news_md
    • de_core_news_lg