Search code examples
pythonnlpwordnetspacylemmatization

How does spacy lemmatizer works?


For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules

Let's take as example the word "wider"

As it is an adjective the rule for lemmatization should be take from this list:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 

As I understand the process will be like this:

1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.

Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?

Here it can be tested.


Solution

  • Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

    Class

    It starts off with initializing 3 variables:

    class Lemmatizer(object):
        @classmethod
        def load(cls, path, index=None, exc=None, rules=None):
            return cls(index or {}, exc or {}, rules or {})
    
        def __init__(self, index, exceptions, rules):
            self.index = index
            self.exc = exceptions
            self.rules = rules
    

    Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

    Why don't Spacy just read a file?

    Most probably because declaring the string in-code is faster that streaming strings through I/O.


    Where does these index, exceptions and rules come from?

    Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

    Rules

    Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749

    And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html

    Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:

    PUNCT_RULES = [
        ["“", "\""],
        ["”", "\""],
        ["\u2018", "'"],
        ["\u2019", "'"]
    ]
    

    Exceptions

    As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.

    It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:

    alvas@ubi:~/nltk_data/corpora/wordnet$ ls
    adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
    adv.exc       data.adj     data.verb  index.noun   lexnames    README
    citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
    alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
    1490 adj.exc
    

    Index

    If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:

    alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 
    
      1 This software and database is being provided to you, the LICENSEE, by  
      2 Princeton University under the following license.  By obtaining, using  
      3 and/or copying this software and database, you agree that you have  
      4 read, understood, and will comply with these terms and conditions.:  
      5   
      6 Permission to use, copy, modify and distribute this software and  
      7 database and its documentation for any purpose and without fee or  
      8 royalty is hereby granted, provided that you agree to comply with  
      9 the following copyright notice and statements, including the disclaimer,  
      10 and that the same appear on ALL copies of the software, database and  
      11 documentation, including modifications that you make for internal  
      12 use or for distribution.  
      13   
      14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
      15   
      16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
      17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
      18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
      19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
      20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
      21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
      22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
      23 OTHER RIGHTS.  
      24   
      25 The name of Princeton University or Princeton may not be used in  
      26 advertising or publicity pertaining to distribution of the software  
      27 and/or database.  Title to copyright in this software, database and  
      28 any associated documentation shall at all times remain with  
      29 Princeton University and LICENSEE agrees to preserve same.  
    00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
    00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
    00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
    00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
    00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
    00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
    00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
    00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
    00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
    00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
    00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  
    

    On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.

    We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

    The main action comes from the function rather than the Lemmatizer class:

    def lemmatize(string, index, exceptions, rules):
        string = string.lower()
        forms = []
        # TODO: Is this correct? See discussion in Issue #435.
        #if string in index:
        #    forms.append(string)
        forms.extend(exceptions.get(string, []))
        oov_forms = []
        for old, new in rules:
            if string.endswith(old):
                form = string[:len(string) - len(old)] + new
                if not form:
                    pass
                elif form in index or not form.isalpha():
                    forms.append(form)
                else:
                    oov_forms.append(form)
        if not forms:
            forms.extend(oov_forms)
        if not forms:
            forms.append(string)
        return set(forms)
    

    Why is the lemmatize method outside of the Lemmatizer class?

    That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod and @classmethod exist perhaps there are other considerations as to why the function and class has been decoupled

    Morphy vs Spacy

    Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:

    1. Check the exception lists
    2. Apply rules once to the input to get y1, y2, y3, etc.
    3. Return all that are in the database (and check the original too)
    4. If there are no matches, keep applying rules until we find a match
    5. Return an empty list if we can't find anything

    For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

    But the general process seems to be:

    1. Look for the exceptions, get them if the lemma from the exception list if the word is in it.
    2. Apply the rules
    3. Save the ones that are in the index lists
    4. If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
    5. Return the lemma forms

    In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.

    >>> from nltk.stem import WordNetLemmatizer
    >>> wnl = WordNetLemmatizer()
    >>> wnl.lemmatize('alvations')
    'alvations'
    

    Checking for infinitive before lemmatization

    Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).

    But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.

    Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:

    • person
    • number
    • gender

    Updated

    SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.

    So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

    Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

    But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)


    Epilogue

    I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"

    But before even answering the question before, "What exactly is a lemma?" might the better question to ask.