Search code examples
pythondocxpython-docx

Search all docx files with python-docx in a directory (batch)


I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.

I figured out how to hard code to one file:

from docx import Document

document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text

print Project

But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.


Solution

  • How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?

    To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()

    import os
    
    from docx import Document
    
    # First, we'll create an empty list to hold the path to all of your docx files
    document_list = []       
    
    # Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx" 
    # (and all it's subfolders) using os.walk().  You could alternatively use os.listdir()
    # to get a list of files.  It would be recommended, and simpler, if all files are
    # in the same folder.  Consider that change a small challenge for developing your skills!
    for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"): 
        for name in files:
            # For each file we find, we need to ensure it is a .docx file before adding
            #  it to our list
            if os.path.splitext(os.path.join(path, name))[1] == ".docx":
                document_list.append(os.path.join(path, name))
    
    # Now create a loop that goes over each file path in document_list, replacing your 
    # hard-coded path with the variable.
    for document_path in document_list:
        document = Document(document_path)        # Change the document being loaded each loop
        table = document.tables[0]
        project_cell = table.rows[2].cells[2]
        paragraph = project_cell.paragraphs[0]
        project = paragraph.text
    
        print project
    

    For additional reading, here is the documentation on os.listdir().

    Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!