Search code examples
pythonpandasdataframenltkstemming

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe


I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset. my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?


Solution

  • Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.

    For this, I exemplarily used the snowball stemmer from nltk.

    from nltk.stem.snowball import SnowballStemmer
    englishStemmer=SnowballStemmer("english") #define stemming dict
    

    And this tokenizer:

    from nltk.tokenize import WhitespaceTokenizer as w_tokenizer
    

    Define your function:

    def stemm_texts(text):
        return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]
    

    Apply the function on your df:

    df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))
    

    Note that I additionally added the NaN ignore part.

    You might want to detokenize again:

    from nltk.tokenize.treebank import TreebankWordDetokenizer
    
    detokenizer = TreebankWordDetokenizer()
    df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))