I'm using the python docx library and need to read data from tables in the document.
Although I'm able to read the data using the following code,
document = Document(path_to_your_docx)
tables = document.tables
for table in tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text)
I get multiple duplicate values where contents in a cell spans its merged cells, once for each cell that is merged into it. I cannot simple delete duplicate values, since there might be multiple unmerged cells with the same value. How should I go about fixing this?
For reference, I was directed to ask the question here from this github issue.
Thank you.
If you want to get each merged cell exactly once, you can add the following code:
def iter_unique_cells(row):
"""Generate cells in `row` skipping empty grid cells."""
prior_tc = None
for cell in row.cells:
this_tc = cell._tc
if this_tc is prior_tc:
continue
prior_tc = this_tc
yield cell
document = Document(path_to_your_docx)
for table in document.tables:
for row in table.rows:
for cell in iter_unique_cells(row):
for paragraph in cell.paragraphs:
print(paragraph.text)
The behavior you see of the same cell in a table appearing once for each "grid" cell it occupies is the expected behavior. It causes problems elsewhere if row cells are not uniform across rows, e.g. if each row in a 3 x 3 table did not necessarily contain 3 cells. For example, accessing row.cell[2] in a three column table would raise an exception if a merged cell was present in that row.
At the same time, it could be useful to have an alternate accessor, perhaps Row.iter_unique_cells()
that didn't guarantee uniformity across rows. That might be a feature worth requesting.