I want to stem my text, which I am reading from CSV file. But after the stem-operator the text is not changed. Than I have read somewhere that I need to use POS tags in order to stem but it didn't help.
Can you please tell me what I am doing wrong? So I am reading the csv, removing punctuation, tokenizing, getting POS tags, and trying to stem but nothing is changing.
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
import nltk
from nltk import pos_tag
stemmer = nltk.PorterStemmer()
data = pd.read_csv(open('data.csv'),sep=';')
translator=str.maketrans('','',string.punctuation)
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';',
quotechar='^', quoting=csv.QUOTE_MINIMAL)
for line in data['sent']:
line = line.translate(translator)
tokens = word_tokenize(line)
tokens_pos = nltk.pos_tag(tokens)
final = [stemmer.stem(tagged_word[0]) for tagged_word in tokens_pos]
writer.writerow(tokens_pos)
Examples of data for stemming:
The question was, what are you going to cut?
Well, again, while you were on the board of the Woods Foundation...
We've got some long-term challenges in this economy.
Thank you in advance for any help!
You should have tried to debug your code. If (after necessary imports) you had just tried print(stemmer.stem("challenges"))
, you would have seen that the stemming does work (the above will print "challeng"). Your problem is a small oversight: You collect the stems in final
, but you print tokens_pos
. So the "solution" is this:
writer.writerow(final)