I am writing a code to extract text from word document with extension of docx. I have a big folder named "EXTRACTION" and this folder contain differents sub-folders (for example : folder 1 , 2 , 3 ect..) and each sub-folder contain from 2 to 10 docx document. I want to extract text from each of those files and put it in a new txt file.
I started writing this code but it is not working (second version of the code):
import os
import glob
import docx
print(os.getcwd())
dirs = dirs = glob.glob('fi*')
path = os.getcwd()
for directory in dirs:
for filename in directory:
if filename.endswith(".docx") or filename.endswith(".doc"):
document = docx.Document(filename)
#docText = []
with open('your_file.txt', 'w') as f:
for paragraph in document.paragraphs:
if paragraph.text:
#docText.append(paragraph.text)
f.write("%s\n" % paragraph.text)
This code seems to not work , Could you help me improve
u can use glob.glob to get a list of all files from subdirectories
files = [file for file_list in [glob.glob('/path/to/mainfolder/**/{}'.format(x),recursive=True) for x in ('*.doc','*.docx')] for file in file_list]
with open('your_file.txt', 'w') as f:
for file in files:
document = docx.Document(filename)
for paragraph in document.paragraphs:
if paragraph.text:
f.write("%s\n" % item)