Search code examples
pythonnltkporter-stemmer

PorterStemmer() trims the last word in a sentence differently


I have the following code for an off-line environment:

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams':  ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}

def rower(x):
    cleanQ = []  
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
    
    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    splitQ = list(map(' '.join, splitQ))
    print(splitQ)
    
    originQ = []    
    for i in splitQ: 
        originQ.append(PorterStemmer().stem(i))
    print(originQ)
    
rower(test.grams)

Which produces this:

['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']

The first list shows the sentences before applying the PorterStemmer() function. The second list shows sentences after applying the PorterStemmer() function.

As you can see, PorterStemmer() trims the word, three, into thre only when the word is positioned as the last word in a sentence. When the word, three, is not the last word, three stays three. I can't seem to figure out why it is doing this. I also worry that if I applied the rower(x) function to other sentences, it may produce similar outcomes without me noticing.

How do I prevent PorterStemmer from treating the last word differently?


Solution

  • The main mistake here is that you are passing multiple words to the stemmer instead of one word at a time. The entire string (third donkey three) is considered one word and the last part is being stemmed.

    import pandas as pd
    import re
    from nltk.stem import PorterStemmer
    
    test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
                      'Third donkey three']}
    test = pd.DataFrame(test, columns=['grams'])
    STOPWORDS = {'and', 'does', 'because'}
    
    ps = PorterStemmer()
    
    def rower(x):
        cleanQ = []
        for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())
    
        splitQ = []
        for row in cleanQ: splitQ.append(row.split())
        splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
        print('IN:', splitQ)
        originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
        print('OUT:', originQ)
    
    
    rower(test.grams)
    

    Output:

    IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
    OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
    

    There are good explanations to why stemming leaves out the last 'e' on some words. Consider using a lemmatizer if the output doesn't meet your expectations.

    How to stop NLTK stemmer from removing the trailing “e”?