Search code examples
pandasnlptokenizelemmatizationstanza

How to lemmatize text column in pandas dataframes using stanza?


I read csv file into pandas dataframe.

my text column is df['story'].

how do I lemmatize this colummn ?

should I tokenize before?


Solution

  • No, you don't necessarily have to tokenize before lemmatizing. You can try the following code:

    import stanza
    import pandas as pd
    
    nlp = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma')
    
    def lemmatize_text(text):
        doc = nlp(text)
        lemmas = [word.lemma for sent in doc.sentences for word in sent.words]
        return ' '.join(lemmas)
    
    df['lemmatized_story'] = df['story'].apply(lemmatize_text)