Search code examples
pythonnlpwordnetspacylemmatization

How to get better lemmas from Spacy


While "PM" can mean "pm(time)" it can also mean "Prime Minister".

I want to capture the latter. I want lemma of "PM" to return "Prime Minister". How can I do this using spacy?

Example returning unexpected lemma:

>>> import spacy
>>> #nlp = spacy.load('en')
>>> nlp = spacy.load('en_core_web_lg')
>>> doc = nlp(u'PM means prime minister')
>>> for word in doc:
...     print(word.text, word.lemma_)
... 
PM pm
means mean
prime prime
minister minister

As per doc https://spacy.io/api/annotation, spacy uses WordNet for lemmas;

A lemma is the uninflected form of a word. The English lemmatization data is taken from WordNet..

When I tried inputting "pm" in Wordnet, it shows "Prime Minister" as one of the lemmas.

What am I missing here?


Solution

  • I think it would help answer your question by clarifying some common NLP tasks.

    Lemmatization is the process of finding the canonical word given different inflections of the word. For example, run, runs, ran and running are forms of the same lexeme: run. If you were to lemmatize run, runs, and ran the output would be run. In your example sentence, note how it lemmatizes means to mean.

    Given that, it doesn't sound like the task you want to perform is lemmatization. It may help to solidify this idea with a silly counterexample: what are the different inflections of a hypothetical lemma "pm": pming, pmed, pms? None of those are actual words.

    It sounds like your task may be closer to Named Entity Recognition (NER), which you could also do in spaCy. To iterate through the detected entities in a parsed document, you can use the .ents attribute, as follows:

    >>> for ent in doc.ents:
    ...     print(ent, ent.label_)
    

    With the sentence you've given, spacy (v. 2.0.5) doesn't detect any entities. If you replace "PM" with "P.M." it will detect that as an entity, but as a GPE.

    The best thing to do depends on your task, but if you want your desired classification of the "PM" entity, I'd look at setting entity annotations. If you want to pull out every mention of "PM" from a big corpus of documents, use the matcher in a pipeline.