I want to loop trough a local folder with a couple thousand text files, remove the stop-words, and save the files in a sub-folder. My code loops through all files, but writes all text files in ONE new file. I need the files separated - as they where, and with the exact same filename, just without the stop-words. What am I doing wrong?
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs
stop_words = set(stopwords.words('english'))
for afile in glob.glob("*.txt"):
file1 = codecs.open(afile, encoding='utf-8')
line = file1.read()
words = word_tokenize(line)
words_without_stop_words = [word for word in words if word not in stop_words]
new_words = " ".join(words_without_stop_words).strip()
appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
appendFile.write(new_words)
appendFile.close()
I see that the filename(s) will be "file1" (line 11) - I just can't get my head around glob (if glob is even the solution?).
Quick Solution:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs
stop_words = set(stopwords.words('english'))
for afile in glob.glob("*.txt"):
file1 = codecs.open(afile, encoding='utf-8')
line = file1.read()
words = word_tokenize(line)
words_without_stop_words = [word for word in words if word not in stop_words]
new_words = " ".join(words_without_stop_words).strip()
subfolder = getSubfolder(afile)
filename = getFilename(afile)
appendFile = open('{}/{}.txt'.format(subfolder,filename),'w', encoding='utf-8')
appendFile.write(new_words)
appendFile.close()
I've never worked with glob or codecs, i believe your problem lies in your last 3 lines of code. You use a constant string ('subfolder/file1.txt') as a final file target - that's why your results land in one file. I replaced the target path with two variables. These variables i get from the functions "getSubfolder()" and "getFilename()". You have to implement these functions in order to get the filename you need.
If i understand your goal correct, your filename stays the same, just in a different folder. Then you can use this line:
appendFile = open('{}/{}.txt'.format('mysubfolder',afile),'w', encoding='utf-8')
Solution while learning:
I would recommend you to take a look at https://github.com/inducer/pudb and follow the execution of every step of your loop. This way you will see and learn what python does, what variable has what value at a certain point in time, and so on.