Search code examples
pythonnltktext-analysisstemming

Stemming in python


I want to stem my text, which I am reading from CSV file. But after the stem-operator the text is not changed. Than I have read somewhere that I need to use POS tags in order to stem but it didn't help.

Can you please tell me what I am doing wrong? So I am reading the csv, removing punctuation, tokenizing, getting POS tags, and trying to stem but nothing is changing.

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag

stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')

translator=str.maketrans('','',string.punctuation)

with open('output.csv', 'w', newline='') as csvfile:
   writer = csv.writer(csvfile, delimiter=';',
                            quotechar='^', quoting=csv.QUOTE_MINIMAL)

   for line in data['sent']:
        line = line.translate(translator)
        tokens = word_tokenize(line)
        tokens_pos = nltk.pos_tag(tokens)
        final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
        writer.writerow(tokens_pos)

Examples of data for stemming:

The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.

Thank you in advance for any help!


Solution

  • You should have tried to debug your code. If (after necessary imports) you had just tried print(stemmer.stem("challenges")), you would have seen that the stemming does work (the above will print "challeng"). Your problem is a small oversight: You collect the stems in final, but you print tokens_pos. So the "solution" is this:

    writer.writerow(final)