Search code examples
pythonlistloopspython-docxdata-extraction

how to extract text from docx files contaning in different folders


I am writing a code to extract text from word document with extension of docx. I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document. I want to extract text from each of those files and put it in a new txt file.

I started writing this code but it is not working (second version of the code):

import os
import glob
import docx



print(os.getcwd())

dirs = dirs = glob.glob('fi*')
path = os.getcwd()

for directory in dirs:
    for filename in directory:
        if filename.endswith(".docx") or filename.endswith(".doc"):
            document = docx.Document(filename)
            #docText = []
            with open('your_file.txt', 'w') as f:
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)

This code seems to not work , Could you help me improve

enter image description here

enter image description here


Solution

  • u can use glob.glob to get a list of all files from subdirectories

    files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]
    
    with open('your_file.txt', 'w') as f:
        for file in files:
            document = docx.Document(filename)    
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        f.write("%s\n" % item)