Search code examples
pythonms-wordpython-docx

Processing objects in order in docx


I want to process objects in the order they are written in a word document. Objects I have encountered are paragraphs, text in paragraphs, runs in paragraphs, text in runs, tables, and paragraphs in a table's cells. So far I have two useful programs. One which goes through the document's paragraphs and acquires the text of the paragraph; stored in a list indexed by [paragraph #]. This same program has the ability to gather the text from runs;stored in 2D list indexed by[paragraph#][run#], but I have not found the runs more useful than the whole text of the paragraph. My second program goes through the whole document and finds tables. When it has a table it goes through the table by row, cell, and the paragraph in the cell.

Now these seem like great building blocks for my goal. I would like to gather text in order. Abstractly, as if a blinking text cursor was being commanded to move by a person holding down the right arrow on a keyboard. As the text cursor moves over objects it is storing them by several indexes labeling the # of the object and the type of the object.

Say I have the sub functions paragraph_read and table_read. Say the document has this order of objects: . I'd like to go through these and perform my sub functions in this order: paragraph_read, paragraph_read, table_read, paragraph_read

I would like to know if my program can move through a document object by object like a cursor swiping right.

Help is greatly appreaciated. Thanks.

-Chris


Solution

  • UPDATE

    There are some new methods in python-docx that take care of much of the detail here:

    Document.iter_inner_content() - provides access to the Paragraph and Table objects in a document, in document order:

    for block_item in document:
        if isinstance(block_item, Paragraph):
            ... process paragraph ...
        elif isinstance(block_item, Table):
            ... process table ...
    

    A table cell is also a block-item container and has the same method. This allows recursing into tables if you want that.

    Header and Footer objects are also block-item containers and have this method.

    A Section is not a block-item container per-se, but does have this method for when you want to iterate through the document section-by-section.


    You need to add this function to your code somewhere convenient:

    from docx.document import Document
    from docx.oxml.table import CT_Tbl
    from docx.oxml.text.paragraph import CT_P
    from docx.table import _Cell, Table
    from docx.text.paragraph import Paragraph
    
    
    def iter_block_items(parent):
        """
        Yield each paragraph and table child within *parent*, in document
        order. Each returned value is an instance of either Table or
        Paragraph. *parent* would most commonly be a reference to a main
        Document object, but also works for a _Cell object, which itself can
        contain paragraphs and tables.
        """
        if isinstance(parent, Document):
            parent_elm = parent.element.body
        elif isinstance(parent, _Cell):
            parent_elm = parent._tc
        else:
            raise ValueError("something's not right")
    
        for child in parent_elm.iterchildren():
            if isinstance(child, CT_P):
                yield Paragraph(child, parent)
            elif isinstance(child, CT_Tbl):
                yield Table(child, parent)
    

    Then you use it like this:

    document = Document('my_document.docx')
    
    for block_item in iter_block_items(document):
        if isinstance(block_item, Paragraph):
            do_paragraph_thing(paragraph=block_item)
        elif isinstance(block_item, Table):
            do_table_thing(table=block_item)
        else:
            # raise an exception or do nothing or whatever. This branch would
            # only be reached on an unforeseen error.