Search code examples
pythonnlpspacy

SpaCy lemmatizer removes capitalization


I would like to lemmatize some textual data in Hungarian language and encountered a strange feature in spaCy. The token.lemma_ function works well in terms of lemmatization, however, it returns some of the sentences without first letter capitalization. This is quite annoying, as my next function, unnest_stences (R) requires first capital letters in order to identify and break the text down into individual sentences. 

First I thought the problem was that I used the latest version of spaCy since I had gotten a warning that

UserWarning: [W031] Model 'hu_core_ud_lg' (0.3.1) requires spaCy v2.1 and is incompatible with the current spaCy version (2.3.2). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version.

So I went ahead and installed spacy 2.1, but the problem still persists. 

The source of my data are some email messages I cannot share here, but here is a small, artificial example:

# pip install -U spacy==2.1 # takes  9 mins
# pip install hu_core_ud_lg # takes 50 mins

import spacy
from spacy.lemmatizer import Lemmatizer
import hu_core_ud_lg
import pandas as pd
nlp = hu_core_ud_lg.load()

a = "Tisztelt levélíró!"
b = "Köszönettel vettük megkeresését."
df = pd.DataFrame({'text':[a, b]})

output_lemma = []

for i in df.text:
    mondat = ""
    doc = nlp(i)    
    for token in doc:
        mondat = mondat + " " + token.lemma_
    output_lemma.append(mondat)

output_lemma

which yields

[' tisztelt levélíró !', ' köszönet vesz megkeresés .']

but I would expect

[' Tisztelt levélíró !', ' Köszönet vesz megkeresés .']

When I pass my original data to the function, it returns some sentences with upercase first letters, others with lowercase letters. For some strange reason I couldn't reproduce that pattern above, but I guess the main point is visible. The function does not work as expected.

Any ideas how I could fix this?

I'm using Jupyter Notebook, Python 2.7, Win 7 and a Toshiba laptop (Portégé Z830-10R i3-2367M).


Solution

  • Lowercasing is the expected behavior of spaCy's lemmatizer for non-proper-noun tokens.

    One workaround is to check if each token is titlecased, and convert to original casing after lemmatizing (only applies to the first character).

    import spacy
    
    nlp = spacy.load('en_core_web_sm')
    
    text = 'This is a test sentence.'
    doc = nlp(text)
    newtext = ' '.join([tok.lemma_.title() if tok.is_title else tok.lemma_ for tok in doc])
    print(newtext)
    # This be a test sentence .