python list loops python-docx data-extraction

how to extract text from docx files contaning in different folders

I am writing a code to extract text from word document with extension of docx. I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document. I want to extract text from each of those files and put it in a new txt file.

I started writing this code but it is not working (second version of the code):

import os
import glob
import docx



print(os.getcwd())

dirs = dirs = glob.glob('fi*')
path = os.getcwd()

for directory in dirs:
    for filename in directory:
        if filename.endswith(".docx") or filename.endswith(".doc"):
            document = docx.Document(filename)
            #docText = []
            with open('your_file.txt', 'w') as f:
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)

This code seems to not work , Could you help me improve

Solution

u can use glob.glob to get a list of all files from subdirectories

files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]

with open('your_file.txt', 'w') as f:
    for file in files:
        document = docx.Document(filename)    
            for paragraph in document.paragraphs:
                if paragraph.text:
                    f.write("%s\n" % item)