Search code examples
pythonpython-3.xglob

Loop through files and save them separately


I want to loop trough a local folder with a couple thousand text files, remove the stop-words, and save the files in a sub-folder. My code loops through all files, but writes all text files in ONE new file. I need the files separated - as they where, and with the exact same filename, just without the stop-words. What am I doing wrong?

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import glob
import os
import codecs

stop_words = set(stopwords.words('english'))

for afile in glob.glob("*.txt"):
    file1 = codecs.open(afile, encoding='utf-8')
    line = file1.read()
    words = word_tokenize(line)
    words_without_stop_words = [word for word in words if word not in stop_words]
    new_words = " ".join(words_without_stop_words).strip()
    appendFile = open('subfolder/file1.txt','w', encoding='utf-8')
    appendFile.write(new_words)
    appendFile.close()

I see that the filename(s) will be "file1" (line 11) - I just can't get my head around glob (if glob is even the solution?).


Solution

  • Quick Solution:

    import io
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    import glob
    import os
    import codecs
    
    stop_words = set(stopwords.words('english'))
    
    for afile in glob.glob("*.txt"):
        file1 = codecs.open(afile, encoding='utf-8')
        line = file1.read()
        words = word_tokenize(line)
        words_without_stop_words = [word for word in words if word not in stop_words]
        new_words = " ".join(words_without_stop_words).strip()
    
        subfolder = getSubfolder(afile)
        filename = getFilename(afile)
        appendFile = open('{}/{}.txt'.format(subfolder,filename),'w', encoding='utf-8')
        appendFile.write(new_words)
        appendFile.close()
    

    I've never worked with glob or codecs, i believe your problem lies in your last 3 lines of code. You use a constant string ('subfolder/file1.txt') as a final file target - that's why your results land in one file. I replaced the target path with two variables. These variables i get from the functions "getSubfolder()" and "getFilename()". You have to implement these functions in order to get the filename you need.

    If i understand your goal correct, your filename stays the same, just in a different folder. Then you can use this line:

        appendFile = open('{}/{}.txt'.format('mysubfolder',afile),'w', encoding='utf-8')
    

    Solution while learning:

    enter image description here

    I would recommend you to take a look at https://github.com/inducer/pudb and follow the execution of every step of your loop. This way you will see and learn what python does, what variable has what value at a certain point in time, and so on.