I have a text file that I am trying to stem
after having removed stopwords
but it seems that nothing changes when I run it. My file is called data0
.
Here are my codes:
## Removing stopwords and tokenizing by words (split each word)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
data0 = word_tokenize(data0)
data0 = ' '.join([word for word in data0 if word not in (stopwords.words('english'))])
print(data0)
## Stemming the data
from nltk.stem import PorterStemmer
ps = PorterStemmer()
data0 = ps.stem(data0)
print(data0)
And I get the following (wrapped for legibility):
For us around Aberdeen , question `` What oil industry ? ( Evening Express , October 26 ) touch deja vu . That question asked almost since day first drop oil pumped North Sea . In past 30 years seen constant cycle ups downs , booms busts industry . I predict happen next . There period worry uncertainty scrabble find something keep local economy buoyant oil gone . Then upturn see jobs investment oil , everyone breathe sigh relief quest diversify go back burner . That downfall . Major industries prone collapse . Look nation 's defunct shipyards extinct coal steel industries . That 's vital n't panic downturns , start planning sensibly future . Our civic business leaders need constantly looking something secure prosperity - tourism , technology , bio-science emerging industries . We need economically strong rather waiting see happens oil roller coaster hits buffers . N JonesEllon
The first part of the code works fine (Removing stopwords and tokenizing), but us the second part (Stem) which does not work. Any idea why?
From the Stemmer docs http://www.nltk.org/howto/stem.html, it looks like the Stemmer is designed to be called on a single word at a time.
Try running it on each word in
[word for word in data0 if word not in (stopwords.words('english'))]
I.e. before calling join
E.g.
stemmed_list = []
for str in [word for word in data0 if word not in (stopwords.words('english'))]:
stemmed_list.append(ps.stem(str))
Edit: Comment Response. I ran the following - and it seemed to stem correctly:
>>> from nltk.stem import PorterStemmer
>>> ps = PorterStemmer()
>>> data0 = '''<Your Data0 string>'''
>>> words = data0.split(" ")
>>> stemmed_words = map(ps.stem, words)
>>> print(list(stemmed_words)) # list cast needed because of 'map'
[..., 'industri', ..., 'diversifi']
I don't think there is a stemmer that can be applied straight to text, but you can wrap it in your own function that takes 'ps' and the text:
def my_stem(text, stemmer):
words = text.split(" ")
stemmed_words = map(stemmer, words)
result = " ".join(list(stemmed_words))
return result