Search code examples
pythonspacylemmatization

Lemma of puncutation in spacy


I'm using spacy for some downstream tasks, mainly noun phrase extraction. My texts contain a lot of parentheses, and while applying the lemma, I noticed all the punctuation that doesn't end sentences becomes --:

import spacy

nlp = spacy.load("de_core_news_sm")
doc = nlp("(Das ist ein Test!)")
for token in doc:
    print(f"Text: '{token.text}', Lemma: '{token.lemma_}'")

Output:

Text: '(', Lemma: '--'
Text: 'Das', Lemma: 'der'
Text: 'ist', Lemma: 'sein'
Text: 'ein', Lemma: 'ein'
Text: 'Test', Lemma: 'Test'
Text: '!', Lemma: '--'
Text: ')', Lemma: '--'

Is that normal, and if yes, why, and what can I do to keep the parentheses?

I'm on 3.7.4 with Python 3.11


Solution

  • I can confirm the issue with German, but when I try the equivalent sentence in Dutch the ( and ) are kept as lemma instead of --. So this is something particular in the German model.

    You can override the default lemmata if you want:

    import spacy
    nlp = spacy.load("de_core_news_sm")
    nlp.get_pipe("attribute_ruler").add([[{"TEXT": "("}]], {"LEMMA": "("})
    nlp.get_pipe("attribute_ruler").add([[{"TEXT": ")"}]], {"LEMMA": ")"})
    
    doc = nlp("(Das ist ein Test!)")
    print(doc.text)
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.dep_)
    

    Result:

    (Das ist ein Test!)
    ( ( PUNCT punct
    Das der PRON sb
    ist sein AUX ROOT
    ein ein DET nk
    Test Test NOUN pd
    ! -- PUNCT punct
    ) ) PUNCT punct