How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?

Solution

Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.

For this, I exemplarily used the snowball stemmer from nltk.

from nltk.stem.snowball import SnowballStemmer
englishStemmer=SnowballStemmer("english") #define stemming dict

And this tokenizer:

from nltk.tokenize import WhitespaceTokenizer as w_tokenizer

Define your function:

def stemm_texts(text):
    return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]

Apply the function on your df:

df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))

Note that I additionally added the NaN ignore part.

You might want to detokenize again:

from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenizer = TreebankWordDetokenizer()
df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))