Search code examples
pythontextnlpnltklemmatization

Using WordNetLemmatizer.lemmatize() with pos_tags throws KeyError


I just read that lemmatization results are best when assisted with pos_tags. Hence I followed the below code but getting KeyError for calculated POS_tags. Below is the code

   from nltk import pos_tag
   x['Phrase']=x['Phrase'].transform(lambda value:value.lower())
   x['Phrase']=x['Phrase'].transform(lambda value:pos_tag(value))

Output after 3rd line (after calculating POS Tags) enter image description here

   from nltk.stem import WordNetLemmatizer 
   lemmatizer = WordNetLemmatizer()
   x['Phrase_lemma']=x['Phrase'].transform(lambda value: ' '.join([lemmatizer.lemmatize(a[0],pos=a[1]) for a in  value]))

Error:

 KeyError                                  Traceback (most recent call last)
  <ipython-input-8-c2400a79a016> in <module>
  1 from nltk.stem import WordNetLemmatizer
  2 lemmatizer = WordNetLemmatizer()
  ----> 3 x['Phrase_lemma']=x['Phrase'].transform(lambda value: ' '.join([lemmatizer.lemmatize(a[0],pos=a[1]) for a in  value]))

 KeyError: 'DT'

Solution

  • You get a KeyError because wordnet is not using the same pos labels. The accepted pos labels for wordnet based on source code are these: adj, adv, adv and verb.

    EDIT based on @bivouac0 's comment:

    So to bypass this issue you have to make a mapper. Mapping function is heavily based on this answer. Non-supported POS will not be lemmatized.

    import nltk
    import pandas as pd
    from nltk.corpus import wordnet
    from nltk.stem import WordNetLemmatizer 
    
    lemmatizer = WordNetLemmatizer()
    
    def get_wordnet_pos(treebank_tag):
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            return None
    
    x = pd.DataFrame(data=[['this is a sample of text.'], ['one more text.']], 
                     columns=['Phrase'])
    
    x['Phrase'] = x['Phrase'].apply(lambda v: nltk.pos_tag(nltk.word_tokenize(v)))
    
    
    x['Phrase_lemma'] = x['Phrase'].transform(lambda value: ' '.join([lemmatizer.lemmatize(a[0],pos=get_wordnet_pos(a[1])) if get_wordnet_pos(a[1]) else a[0] for a in  value]))