Search code examples
pythonpandasdataframenltklemmatization

How can I lemmatize a tokenized column of a dataframe in python?


I try to lemmatize the column "tokenized" in a dataframe. One cell of the column "tokenized" looks as follows " yeah simply zurich generic serving think media bland prepared curry kind paying well loves used parboiled oily place elaborate non tasteful stay underspiced institution vegetarian indian clueless away hiltl anyone served support veg long like normal strong worth insult not rice kitchen know wont food cuisine fantastic fan time term patrons ".

When I run my code it returns something like this: ",,e,n,d,e,d,,,p,a,y,i" which is not what i want. How can I lemmatize full words?

This is my code:

reviews_english['tokenized_lem'] = reviews_english['tokenized'].apply(
                    lambda lst:[lmtzr.lemmatize(word) for word in lst])
reviews_english

Solution

  • The problem is that your "tokenized" column doesn't look ready to apply the lemmatization step, as it contains a string, not a list of tokens. In other words, instead of having

    " yeah simply zurich generic serving ..."
    

    you should have in your dataframe tokenized cell a list of tokens (generated with a tokenizer from your initial sentence), as in

    ["yeah", "simply", "zurich", "generic", "serving", ...]
    

    If you don't have a proper list of tokens in your dataframe cell, python will iterate in your apply/lambda list comprehension character by character, which is clearly not what you want.