I'm using spacy for some downstream tasks, mainly noun phrase extraction. My texts contain a lot of parentheses, and while applying the lemma, I noticed all the punctuation that doesn't end sentences becomes --
:
import spacy
nlp = spacy.load("de_core_news_sm")
doc = nlp("(Das ist ein Test!)")
for token in doc:
print(f"Text: '{token.text}', Lemma: '{token.lemma_}'")
Output:
Text: '(', Lemma: '--'
Text: 'Das', Lemma: 'der'
Text: 'ist', Lemma: 'sein'
Text: 'ein', Lemma: 'ein'
Text: 'Test', Lemma: 'Test'
Text: '!', Lemma: '--'
Text: ')', Lemma: '--'
Is that normal, and if yes, why, and what can I do to keep the parentheses?
I'm on 3.7.4 with Python 3.11
I can confirm the issue with German, but when I try the equivalent sentence in Dutch the (
and )
are kept as lemma instead of --
. So this is something particular in the German model.
You can override the default lemmata if you want:
import spacy
nlp = spacy.load("de_core_news_sm")
nlp.get_pipe("attribute_ruler").add([[{"TEXT": "("}]], {"LEMMA": "("})
nlp.get_pipe("attribute_ruler").add([[{"TEXT": ")"}]], {"LEMMA": ")"})
doc = nlp("(Das ist ein Test!)")
print(doc.text)
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_)
Result:
(Das ist ein Test!)
( ( PUNCT punct
Das der PRON sb
ist sein AUX ROOT
ein ein DET nk
Test Test NOUN pd
! -- PUNCT punct
) ) PUNCT punct