Python docx row.cells return a "merged" cell multiple times

I'm using the python docx library and need to read data from tables in the document.

Although I'm able to read the data using the following code,

document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text)

I get multiple duplicate values where contents in a cell spans its merged cells, once for each cell that is merged into it. I cannot simple delete duplicate values, since there might be multiple unmerged cells with the same value. How should I go about fixing this?

For reference, I was directed to ask the question here from this github issue.

Thank you.

Solution

If you want to get each merged cell exactly once, you can add the following code:

def iter_unique_cells(row):
    """Generate cells in `row` skipping empty grid cells."""
    prior_tc = None
    for cell in row.cells:
        this_tc = cell._tc
        if this_tc is prior_tc:
            continue
        prior_tc = this_tc
        yield cell


document = Document(path_to_your_docx)
for table in document.tables:
    for row in table.rows:
        for cell in iter_unique_cells(row):
            for paragraph in cell.paragraphs:
                print(paragraph.text)

The behavior you see of the same cell in a table appearing once for each "grid" cell it occupies is the expected behavior. It causes problems elsewhere if row cells are not uniform across rows, e.g. if each row in a 3 x 3 table did not necessarily contain 3 cells. For example, accessing row.cell[2] in a three column table would raise an exception if a merged cell was present in that row.

At the same time, it could be useful to have an alternate accessor, perhaps Row.iter_unique_cells() that didn't guarantee uniformity across rows. That might be a feature worth requesting.