Search code examples
pythonregexpython-docx

Parsing docx files in Python


I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:

A. Cat

B. Dog

C. Pig

D. Fox

If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc

I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.


Solution

  • what's happening is the the loop is continuing past D.Fox, and so in this new loop, even though there is no match, it is printing the last value of head1, which is D.Fox.

    I think it is the for run in paragraph.runs: that is somehow running twice, maybe there's a second "run" that is there but invisible?

    Perhaps adding a break when the first match is found is enough to prevent the second run triggering?

    for file in os.listdir(directory):
    
    document = Document(directory+file)
    
    head1s = []
    
    for paragraph in document.paragraphs:
    
        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)
    
        for run in paragraph.runs:
    
            if run.bold:
    
                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)
                    # this break stops the run loop if a match was found.
                    break
    
    print(head1s)