Search code examples
pythonpython-docx

Best way to handle images in header/footer when replacing paragraph.text? Python-docx


I have a tool that find/replaces for a bulk amount of documents. Here is the code for this:

        # Find the index of the find_text (case-insensitive)
        index = paragraph.text.lower().find(find.entryWord.lower())

        # Calculate the end of the substring to be replaced
        end_index = index + len(find.entryWord)

        # Create the modified paragraph
        modified_paragraph = (
            paragraph.text[:index]
            + paragraph.text[index:end_index]
            .lower()
            .replace(find.entryWord.lower(), replace.entryWord)
            + paragraph.text[end_index:]
        )

        paragraph.text = modified_paragraph

This works well, it is just when I have an image in the paragraph that I'm working on, that it deletes the image. This happens when I set paragraph.text = modified_paragraph.

To my question, is there any better way to handle images that are in the same paragraph as text that I'm looking to change?

Note: I know of runs within python-docx, but they are extremely inconsistent with how words get broken up so I would prefer to avoid using those if I can.


Solution

  • There are no easy answers to this problem, they all involve working at and below the run level. But a few things that would be helpful to understand:

    1. An image is enclosed within a w:drawing element and those occur as a child of run.
    2. You can iterate the "inner-content" of a run to discover Drawing objects, which are the proxy for a w:drawing element.
    3. When you call Paragraph.text, all the existing run elements are removed from the paragraph. They are not, however, automatically deleted. They just wait around for garbage-collection once the last reference to them goes out of scope.

    So a potential approach is to:

    1. detect the run that contains the drawing
    2. replace the paragraph text
    3. re-add the run containing the drawing

    This will entail working at the XML level for certain steps, so things like paragraph._p.append(run._r) if you're familiar with that sort of thing from other python-docx questions and answers.