Search code examples
pythondocxpython-docx

Python docx - Modify runs to target specific words


I’m developing a code in python that searches a docx file for certain variables, for example find the word “car” and highlight it with a defined colour.

I’m using the docx module to identify and highlight the text and I can apply the changes on a run level (run.font.highlight) but since MS Word stores the text in a xml file that keeps tracks of all the changes, the words I’m looking for can be split through different runs or be part of a long sentence. Since my final goal is to target one or multiple defined words, I’m struggling to get to this expected result.

My main idea would be to run a function to “clean” the runs or the xml file, to have my target words in isolated runs that can then be highlighted, but I haven’t found any documentation about this and I’m worried about losing fonts properties, styles, etc...

This is the code that I have so far:

import docx
from docx.enum.text import WD_COLOR_INDEX
import re

doc = docx.Document('demo.docx')

words = {'car': 'RED',
         'bus': 'GREEN',
         'train station': 'BLUE'}

for word, color in words.items():
    w = re.compile(fr'\b{word}\b')
    
    for par in doc.paragraphs:
        for run in par.runs:
            s = re.findall(w, run.text)
            if s:
                run.font.highlight_color = getattr(WD_COLOR_INDEX, color)

doc.save('new.docx')

Does anyone ever encountered the same problem or have an idea on a different approach?

Thanks


Solution

  • This function can be used to isolate a run within a paragraph based on the match.start() and match.end() values you get from a regex match on paragraph.text. From there you can change the properties of the returned run however you like without affecting adjacent text:

    def isolate_run(paragraph, start, end):
        """Return docx.text.Run object containing only `paragraph.text[start:end]`.
    
        Runs are split as required to produce a new run at the `start` that ends at `end`.
        Runs are unchanged if the indicated range of text already occupies its own run. The
        resulting run object is returned.
    
        `start` and `end` are as in Python slice notation. For example, the first three
        characters of the paragraph have (start, end) of (0, 3). `end` is not the index of
        the last character. These correspond to `match.start()` and `match.end()` of a regex
        match object and `s[start:end]` of Python slice notation.
        """
        rs = tuple(paragraph._p.r_lst)
    
        def advance_to_run_containing_start(start, end):
            """Return (r_idx, start, end) triple indicating start run and adjusted offsets.
    
            The start run is the run the `start` offset occurs in. The returned `start` and
            `end` values are adjusted to be relative to the start of `r_idx`.
            """
            # --- add 0 at end so `r_ends[-1] == 0` ---
            r_ends = tuple(itertools.accumulate(len(r.text) for r in rs)) + (0,)
            r_idx = 0
            while start >= r_ends[r_idx]:
                r_idx += 1
            skipped_rs_offset = r_ends[r_idx - 1]
            return rs[r_idx], r_idx, start - skipped_rs_offset, end - skipped_rs_offset
    
        def split_off_prefix(r, start, end):
            """Return adjusted `end` after splitting prefix off into separate run.
    
            Does nothing if `r` is already the start of the isolated run.
            """
            if start > 0:
                prefix_r = copy.deepcopy(r)
                r.addprevious(prefix_r)
                r.text = r.text[start:]
                prefix_r.text = prefix_r.text[:start]
            return end - start
    
        def split_off_suffix(r, end):
            """Split `r` at `end` such that suffix is in separate following run."""
            suffix_r = copy.deepcopy(r)
            r.addnext(suffix_r)
            r.text = r.text[:end]
            suffix_r.text = suffix_r.text[end:]
    
        def lengthen_run(r, r_idx, end):
            """Add prefixes of following runs to `r` until `end` is reached."""
            while len(r.text) < end:
                suffix_len_reqd = end - len(r.text)
                r_idx += 1
                next_r = rs[r_idx]
                if len(next_r.text) <= suffix_len_reqd:
                    # --- subsume next run ---
                    r.text = r.text + next_r.text
                    next_r.getparent().remove(next_r)
                    continue
                if len(next_r.text) > suffix_len_reqd:
                    # --- take prefix from next run ---
                    r.text = r.text + next_r.text[:suffix_len_reqd]
                    next_r.text = next_r.text[suffix_len_reqd:]
    
        r, r_idx, start, end = advance_to_run_containing_start(start, end)
        end = split_off_prefix(r, start, end)
    
        # --- if run is longer than isolation-range we need to split-off a suffix run ---
        if len(r.text) > end:
            split_off_suffix(r, end)
        # --- if run is shorter than isolation-range we need to lengthen it by taking text
        # --- from subsequent runs
        elif len(r.text) < end:
            lengthen_run(r, r_idx, end)
    
        return Run(r, paragraph)
    

    It's more complicated to do than one might think; it was definitely more complicated than I thought when I started working on it. In any case, it's something that comes in handy from time to time.