I'm trying to lemmatize tokenized column comments_tokenized
I do:
import nltk
from nltk.stem import WordNetLemmatizer
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)
but have
TypeError: unhashable type: 'list'
What can I do to lemmatize a column with bag of words?
And also how to avoid the problem with tokenization that divides [don't] to [do,n't]?
You were close on your function! since you are using apply
on the series, you don't need to specifically call out the column in the function. you also are not using the input text
at all in your function. So change
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
to
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text] ##Notice the use of text.
An example:
df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
A
0 [cats, cacti, geese, rocks]
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text]
df['A'].apply(lemmatize_text)
0 [cat, cactus, goose, rock]