Search code examples
python-3.xpandasnltkstemminglemmatization

Lemmatize df column


I am trying to lemmatize content in a df but the function I wrote isn't working. Prior to trying to lemmatize the data in the column looked like this.

enter image description here

Then I ran the following code:

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]  

df['content'] = df["content"].apply(lemmatize_text)
print(df.content)

Now the content column looks like this:

enter image description here

I'm not sure what i did wrong, but I am just trying to lemmatize the data in the content column. Any help would be greatly appreciated.


Solution

  • You are lemmatizing each char instead of word. Your function should look like this instead:

    def lemmatize_text(text):
        lemmatizer = WordNetLemmatizer()
        return ' '.join([lemmatizer.lemmatize(w) for w in text.split(' ')])