Search code examples
pythondocxpython-docx

How can I extract index marker data from a docx document using python-docx?


Given a simple paragraph block, I would like to extract index marker data from it.

A simple code like this:

print(block.text)

for run in block.runs:
    print(run)

would print out paragraph text and a list of the associated runs, one of which (I understand) contains a special XE (Index Entry) field.

This is a test.
<docx.text.run.Run object at 0x7f800f369c50>
<docx.text.run.Run object at 0x7f800f369da0>
<docx.text.run.Run object at 0x7f800f369dd8>
<docx.text.run.Run object at 0x7f800f369c18>
<docx.text.run.Run object at 0x7f800f369e48>
<docx.text.run.Run object at 0x7f800f369eb8>
<docx.text.run.Run object at 0x7f800f369f28>

I need to extract data from the run which contains the index marker and the position of the run in the paragraph (i.e. nth character).

Is there an api I missed in the python-docx library which might help? Or, should I parse raw XML? How can I get raw XML of the paragraph?

Thanks!!


Solution

  • You can drop down to the lxml/oxml layer for this.

    You would need some sort of "outer" loop to keep track of the current offset. A generator function might be convenient for that.

    def iter_xe_runs_with_offsets(paragraph):
        """Generate (run, run_idx, text_offset) triples from `paragraph`."""
        text_offset = 0
        for run_idx, run in enumerate(paragraph.runs):
            if contains_index_marker(run):
                yield (run, run_idx, text_offset)
            text_offset += len(run.text)
    

    Then a processing method can use that to do the needful:

    def process_paragraph(paragraph):
        for run, run_idx, text_offset in iter_xe_runs_with_offsets(paragraph):
            # ... do the needful ...
    

    And you need a supporting helper to tell whether the run has the index marker. This would use lxml.etree._Element methods on the run._r run-element object.

    def contains_index_marker(run):
        """Return True if `run` is marked as index entry."""
        r = run._r
        # ... use lxml on `r` to identify presence of "index marker"
        # the code to do that depends on whether it is an attribute or
        # child element.