Search code examples
python-3.xpython-docx

python-docx - find and replace fails on table with multiple merged cells (list index out of range)


I'm trying to do a find and replace using python-docx.

I'm using this code developed by adejones: How to search and replace a word/text in word document using python-docx

I've used this code on smaller documents, but the current document I'm using is quite large (with many tables) so I'm trying to resolve what's causing the "list index out of range". After lots of debugging, I realized it happens when dealing with a table that has many merged cells, something like this:

Example table

The error happens at "for cell in row.cells". Through the debugger, I have identified that t.rows does have an object and row has an object, so everything prior to getting one of the cells works.

I can't share the document unfortunately, but curious if anyone had any insights to get past this hurdle.

Error:

 File "C:\Anaconda3\lib\site-packages\docx\table.py", line 161, in _cells
    cells.append(cells[-col_count])

IndexError: list index out of range

Code:

def docx_find_replace_text(doc, search_text, replace_text):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        if search_text in p.text:
            inline = p.runs
            # Replace strings and retain the same style.
            # The text to be replaced can be split over several runs so
            # search through, identify which runs need to have text replaced
            # then replace the text in those identified
            started = False
            search_index = 0
            # found_runs is a list of (inline index, index of match, length of match)
            found_runs = list()
            found_all = False
            replace_done = False
            for i in range(len(inline)):

                # case 1: found in single run so short circuit the replace
                if search_text in inline[i].text and not started:
                    found_runs.append((i, inline[i].text.find(search_text), len(search_text)))
                    text = inline[i].text.replace(search_text, str(replace_text))
                    inline[i].text = text
                    replace_done = True
                    found_all = True
                    break

                if search_text[search_index] not in inline[i].text and not started:
                    # keep looking ...
                    continue

                # case 2: search for partial text, find first run
                if search_text[search_index] in inline[i].text and inline[i].text[-1] in search_text and not started:
                    # check sequence
                    start_index = inline[i].text.find(search_text[search_index])
                    check_length = len(inline[i].text)
                    for text_index in range(start_index, check_length):
                        if inline[i].text[text_index] != search_text[search_index]:
                            # no match so must be false positive
                            break
                    if search_index == 0:
                        started = True
                    chars_found = check_length - start_index
                    search_index += chars_found
                    found_runs.append((i, start_index, chars_found))
                    if search_index != len(search_text):
                        continue
                    else:
                        # found all chars in search_text
                        found_all = True
                        break

                # case 2: search for partial text, find subsequent run
                if search_text[search_index] in inline[i].text and started and not found_all:
                    # check sequence
                    chars_found = 0
                    check_length = len(inline[i].text)
                    for text_index in range(0, check_length):
                        if inline[i].text[text_index] == search_text[search_index]:
                            search_index += 1
                            chars_found += 1
                        else:
                            break
                    # no match so must be end
                    found_runs.append((i, 0, chars_found))
                    if search_index == len(search_text):
                        found_all = True
                        break

            if found_all and not replace_done:
                for i, item in enumerate(found_runs):
                    index, start, length = [t for t in item]
                    if i == 0:
                        text = inline[index].text.replace(inline[index].text[start:start + length], str(replace_text))
                        inline[index].text = text
                    else:
                        text = inline[index].text.replace(inline[index].text[start:start + length], '')
                        inline[index].text = text

Solution

  • I seem to have found the answer.

    When iterating through tables, more specifically rows, docx does not like merged rows. Merged columns are ok because they can be part of a row. However, if more than 2 rows are merged, it is not sure how to parse the remaining columns. See image below. I'm not sure why one merge is ok, but multiple merged rows are not. Once I removed the merged rows from my tables, the code worked.

    Row confusion