How does spacy lemmatizer works?

For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules

Let's take as example the word "wider"

As it is an adjective the rule for lemmatization should be take from this list:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
]

As I understand the process will be like this:

1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.

Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?

Here it can be tested.

Solution

Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

Class

It starts off with initializing 3 variables:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

Why don't Spacy just read a file?

Most probably because declaring the string in-code is faster that streaming strings through I/O.

Where does these index, exceptions and rules come from?

Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

Rules

Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749

And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html

Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:

PUNCT_RULES = [
    ["“", "\""],
    ["”", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]

Exceptions

As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.

It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

Index

If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels

On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.

We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

The main action comes from the function rather than the Lemmatizer class:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

Why is the `lemmatize` method outside of the `Lemmatizer` class?

That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod and @classmethod exist perhaps there are other considerations as to why the function and class has been decoupled

Morphy vs Spacy

Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:

Check the exception lists
Apply rules once to the input to get y1, y2, y3, etc.
Return all that are in the database (and check the original too)
If there are no matches, keep applying rules until we find a match
Return an empty list if we can't find anything

For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

But the general process seems to be:

Look for the exceptions, get them if the lemma from the exception list if the word is in it.
Apply the rules
Save the ones that are in the index lists
If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
Return the lemma forms

In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'

Checking for infinitive before lemmatization

Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).

But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.

Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:

person
number
gender

Updated

SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.

So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)

Epilogue

I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"

But before even answering the question before, "What exactly is a lemma?" might the better question to ask.