Search code examples
pythonnlpspacylemmatizationnamed-entity-recognition

How to perform NER on true case, then lemmatization on lower case, with spaCy


I try to lemmatize a text using spaCy 2.0.12 with the French model fr_core_news_sm. Morevoer, I want to replace people names by an arbitrary sequence of characters, detecting such names using token.ent_type_ == 'PER'. Example outcome would be "Pierre aime les chiens" -> "~PER~ aimer chien".

The problem is I can't find a way to do both. I only have these two partial options:

  • I can feed the pipeline with the original text: doc = nlp(text). Then, the NER will recognize most people names but the lemmas of words starting with a capital won't be correct. For example, the lemmas of the simple question "Pouvons-nous faire ça?" would be ['Pouvons', '-', 'se', 'faire', 'ça', '?'], where "Pouvons" is still an inflected form.
  • I can feed the pipeline with the lower case text: doc = nlp(text.lower()). Then my previous example would correctly display ['pouvoir', '-', 'se', 'faire', 'ça', '?'], but most people names wouldn't be recognized as entities by the NER, as I guess a starting capital is a useful indicator for finding entities.

My idea would be to perform the standard pipeline (tagger, parser, NER), then lowercase, and then lemmatize only at the end.

However, lemmatization doesn't seem to have its own pipeline component and the documentation doesn't explain how and where it is performed. This answer seem to imply that lemmatization is performed independent of any pipeline component and possibly at different stages of it.

So my question is: how to choose when to perform the lemmatization and which input to give to it?


Solution

  • If you can, use the most recent version of spacy instead. The French lemmatizer has been improved a lot in 2.1.

    If you have to use 2.0, consider using an alternate lemmatizer like this one: https://spacy.io/universe/project/spacy-lefff