Search code examples
pandasnltklemmatization

Lemmatize tokenised column in pandas


I'm trying to lemmatize tokenized column comments_tokenized

enter image description here

I do:

import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)

but have

TypeError: unhashable type: 'list'

What can I do to lemmatize a column with bag of words?

And also how to avoid the problem with tokenization that divides [don't] to [do,n't]?


Solution

  • You were close on your function! since you are using apply on the series, you don't need to specifically call out the column in the function. you also are not using the input text at all in your function. So change

    def lemmatize_text(text):
        return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
    

    to

    def lemmatize_text(text):
        lemmatizer = WordNetLemmatizer()
        return [lemmatizer.lemmatize(w) for w in text]  ##Notice the use of text.
    

    An example:

    df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
                                 A
    0  [cats, cacti, geese, rocks]
    
    def lemmatize_text(text):
        lemmatizer = WordNetLemmatizer()
        return [lemmatizer.lemmatize(w) for w in text]
    
    df['A'].apply(lemmatize_text)
    
    0    [cat, cactus, goose, rock]