Lemmatize tokenised column in pandas

I'm trying to lemmatize tokenized column comments_tokenized

I do:

import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)

but have

TypeError: unhashable type: 'list'

What can I do to lemmatize a column with bag of words?

And also how to avoid the problem with tokenization that divides [don't] to [do,n't]?

Solution

You were close on your function! since you are using apply on the series, you don't need to specifically call out the column in the function. you also are not using the input text at all in your function. So change

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]  ##Notice the use of text.

An example:

df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
                             A
0  [cats, cacti, geese, rocks]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]

df['A'].apply(lemmatize_text)

0    [cat, cactus, goose, rock]