Search code examples
pythonpython-docx

Determine the section heading of a table with python-docx


I need to find (and extract) the section heading of certain tables in a DOCX file.

The problem is, there might be empty paragraphs or even other tables before a table of relevance, so I'd need to iterate backwards until a heading of any level.

Document
    Heading
    (paragraphs)
    Table 1
        Subheading
        (paragraphs)
        (irrelevant table)
        Table 2

My starting point is as follows:

from docx import Document
doc = Document(infile)
for i, table in enumerate(doc.tables):  
    for previous paragraph:  # <=== How can I iterate backwards?
        if paragraph.style.name.startswith('Heading'):
            heading = paragraph.text
            break

Thanks in advance!


Solution

  • You should use the Document object's iter_inner_content() method.

    Documented here: https://python-docx.readthedocs.io/en/latest/api/document.html#docx.document.Document.iter_inner_content

    Document.iter_inner_content() will allow you to iterate through both paragraphs and tables in the order they appear in the document. You can keep track of the current heading as you iterate through paragraphs, updating a variable each time you reach a new heading, and then reference/output it when you reach a table.