I have to remove stop words from text file containing 50K tweets . when i run this code ,it removes the stopwords successfully but at the same time it removes white space also. I want white space in the text.
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import codecs
import nltk
stopset = set(stopwords.words('english'))
writeFile = codecs.open("outputfile", "w", encoding='utf-8')
with codecs.open("inputfile", "r", encoding='utf-8') as f:
line = f.read()
tokens = nltk.word_tokenize(line)
tokens = [w for w in tokens if not w in stopset]
for token in tokens:
writeFile.write(token)
When you write, write whitespace where you want whitespace. In your concrete case, a newline after each token would seem suitable, since you are killing all other formatting already. Using print
instead of write
does that without requiring you to mark up with an explicit newline:
from __future__ import print_function # if you're on Python 2
# ...
for token in tokens:
print(token, file=writeFile)
Alternatively, if you want spaces instead of newlines, put spaces. If you have a limited amount of tokens, you could just
print(' '.join(tokens), file=writeFile)
but this will eat up a gob of memory to join the string together before printing, so a loop over the tokens would be more economical. But because you are processing a line at a time, joining is probably good enough, and will get you the tokens from one input line together on one output line.
If you have a large amount of tokens per line, and want to loop over them for memory efficiency, a common idiom is to declare a separator which is initially empty:
sep = ''
for token in tokens:
writeFile.write('{}{}'.format(sep, token)) # str.format(): py >= 2.6
sep=' '
writeFile.write('\n')