I would like to lemmatize some textual data in Hungarian language and encountered a strange feature in spaCy
. The token.lemma_
function works well in terms of lemmatization, however, it returns some of the sentences without first letter capitalization. This is quite annoying, as my next function, unnest_stences
(R) requires first capital letters in order to identify and break the text down into individual sentences.
First I thought the problem was that I used the latest version of spaCy since I had gotten a warning that
UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version.
So I went ahead and installed spacy 2.1, but the problem still persists.
The source of my data are some email messages I cannot share here, but here is a small, artificial example:
# pip install -U spacy==2.1 # takes 9 mins
# pip install hu_core_ud_lg # takes 50 mins
import spacy
from spacy.lemmatizer import Lemmatizer
import hu_core_ud_lg
import pandas as pd
nlp = hu_core_ud_lg.load()
a = "Tisztelt levélíró!"
b = "Köszönettel vettük megkeresését."
df = pd.DataFrame({'text':[a, b]})
output_lemma = []
for i in df.text:
mondat = ""
doc = nlp(i)
for token in doc:
mondat = mondat + " " + token.lemma_
output_lemma.append(mondat)
output_lemma
which yields
[' tisztelt levélíró !', ' köszönet vesz megkeresés .']
but I would expect
[' Tisztelt levélíró !', ' Köszönet vesz megkeresés .']
When I pass my original data to the function, it returns some sentences with upercase first letters, others with lowercase letters. For some strange reason I couldn't reproduce that pattern above, but I guess the main point is visible. The function does not work as expected.
Any ideas how I could fix this?
I'm using Jupyter Notebook, Python 2.7, Win 7 and a Toshiba laptop (Portégé Z830-10R i3-2367M).
Lowercasing is the expected behavior of spaCy's lemmatizer for non-proper-noun tokens.
One workaround is to check if each token is titlecased, and convert to original casing after lemmatizing (only applies to the first character).
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'This is a test sentence.'
doc = nlp(text)
newtext = ' '.join([tok.lemma_.title() if tok.is_title else tok.lemma_ for tok in doc])
print(newtext)
# This be a test sentence .