Search code examples
pythonnlpporter-stemmer

Maintain proper nouns and capitalised words while stemming


I am designing a text processing program and need to stem the words for exploratory analysis later. One of my processes is to stem the words and I have to use Porter Stemmer.

I have designed a DataFrame structure to store my data. Furthermore, I have also designed a function to apply to the DataFrame. When I apply the function to the DataFrame, the stemming works but it does not keep the capitalised (or proper nouns) words.

A snippet of my code:

from nltk.stem.porter import PorterStemmer

def stemming(word):
    stemmer = PorterStemmer()
    word = str(word)
    if word.title():
        stemmer.stem(word).capitalize()
    elif word.isupper():
        stemmer.stem(word).upper()
    else:
        stemmer.stem(word)
    return word

dfBody['body'] = dfBody['body'].apply(lambda x: [stemming(y) for y in x])

This is my result with that has no capitalised words: output

Sample of dataset (my dataset is very large):

file    body
PP3169 ['performing', 'Maker', 'USA', 'computer', 'Conference', 'NIPS']

Expected output (after applying stemming function):

file    body
PP3169 ['perform', 'Make', 'USA', 'comput', 'Confer', 'NIPS']

Any advice will be greatly appreciated!


Solution

  • First: you should assing result to word

    word = stemmer.stem(word).capitalize()
    

    Second: word.title() doesn't check if word is capitalized but it creates capitalized word so you should check

    if word == word.title():
    

    eventually

    if word[0].isupper() and word[1:].islower():
    

    def stemming(word):
        stemmer = PorterStemmer()
        word = str(word)
        if word == word.title():
            word = stemmer.stem(word).capitalize()
        elif word.isupper():
            word = stemmer.stem(word).upper()
        else:
            word = stemmer.stem(word)
        return word