Search code examples
pythonnlpnltktaggingstemming

NLP: How do I combine stemming and tagging?


I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:

#### Stemming
ps = PorterStemmer()    # PorterStemmer imported from nltk.stem

stemText = []

for word in swFiltText:    # Tagged text w/o stop words
    stemText.append(ps.stem(word))


#### POS Tagging
def tagging():
    tagTot = []
    try:
        for i in stemText:
            words = nltk.word_tokenize(i)    # I need to tokenize again (idk why?)
            tagged = nltk.pos_tag(words)
            tagTot = tagTot + tagged    # Combine tagged words into list

    except Exception as e:
        print(str(e))
    return tagTot

tagText = tagging()

At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.

Should I be using a different stemmer/tagger? Or is the error in my code?

Thanks in advance!


Solution

  • I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.