Search code examples
pythonpython-3.xnlpnltklemmatization

python lemmatizer that lemmatize "political" and "politics" to the same word


I've been testing different python lemmatizers for a solution I'm building out. One difficult problem I've faced is that stemmers are producing non english words which won't work for my use case. Although stemmers get "politics" and "political" to the same stem correctly, I'd like to do this with a lemmatizer, but spacy and nltk are producing different words for "political" and "politics". Does anyone know of a more powerful lemmatizer? My ideal solution would look like this:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("political = ", lemmatizer.lemmatize("political"))
print("politics = ", lemmatizer.lemmatize("politics"))  

returning:

political =  political
politics =  politics  

Where I want to return:

political =  politics
politics =  politics  

Solution

  • Firstly, a lemma is not a "root" word as you thought it to be. It's just a form that exist in the dictionary and for English in NLTK WordNetLemmatizer the dictionary is WordNet and as long as the dictionary entry is in WordNet it is a lemma, there are entries for "political" and "politics", so they're valid lemma:

    from itertools import chain
    print(set(chain(*[ss.lemma_names() for ss in wn.synsets('political')])))
    print(set(chain(*[ss.lemma_names() for ss in wn.synsets('politics')])))
    

    [out]:

    {'political'}
    {'political_sympathies', 'political_relation', 'government', 'politics', 'political_science'}
    

    Maybe there are other tools out there that can do that, but I'll try this as a first.

    First, stem all lemma names and group the lemmas with the same stem:

    from collections import defaultdict
    
    from wn import WordNet
    from nltk.stem import PorterStemmer
    
    porter = PorterStemmer()
    wn = WordNet()
    
    x = defaultdict(set)
    i = 0
    for lemma_name in wn.all_lemma_names():
        if lemma_name:
            x[porter.stem(lemma_name)].add(lemma_name)
            i += 1
    

    Note: pip install -U wn

    Then as a sanity check, we check that the no. of lemmas > no. of groups:

    print(len(x.keys()), i)
    

    [out]:

    (128442, 147306)
    

    Then we can take a look at the groupings:

    for k in sorted(x):
        if len(x[k]) > 1:
            print(k, x[k])
    

    It seems to do what we need to group the words together with their "root word", e.g.

    poke {'poke', 'poking'}
    polar {'polarize', 'polarity', 'polarization', 'polar'}
    polaris {'polarisation', 'polarise'}
    pole_jump {'pole_jumping', 'pole_jumper', 'pole_jump'}
    pole_vault {'pole_vaulter', 'pole_vault', 'pole_vaulting'}
    poleax {'poleaxe', 'poleax'}
    polem {'polemically', 'polemics', 'polemic', 'polemical', 'polemize'}
    police_st {'police_state', 'police_station'}
    polish {'polished', 'polisher', 'polish', 'polishing'}
    polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
    poll {'poll', 'polls'}
    

    But if we look closer there is some confusion:

    polit {'politics', 'politic', 'politeness', 'polite', 'politically', 'politely', 'political'}
    

    So I would suggest the next step is

    to loop through the groupings again and run some semantics and check the "relatedness" of the words and split the words that might not be related, maybe try something like Universal Sentence Encoder, e.g. https://colab.research.google.com/drive/1BM-eKdFb2G2zXqNt3dHgVm4gH8PaPJOq (might not be a trivial task)

    Or do some manual work and reorder the groupings. (The heavy lifting of the work is already done by the porter stemmer in the grouping, now it's time to do some human work)

    Then you'll have to somehow find the root among each group of words (i.e. prototype/label for the cluster).

    Finally using the resource of groups of words you've created, you can not "find the root word.