Search code examples
python-docx

I'm getting errors in documents generated with python-docx, specifically if I include tables from a template


I am using python-docx to programmatically insert data into a new document. When opening the new file, I get the following error message.

Word found unreadable content in document_name. Do you want to recover the contents of this document? If you trust the source of this document, click Yes.

Here is the process that my code is going through to get to this point:

  1. Copy a docx file that we will call our findings templates to a working folder
  2. Copy another docx file that is our report document to the same working folder
  3. Locate a table in our findings document that we want to include in the report
  4. Fill in some data in the table, and put the now completed table into the report document.
  5. Save the report document as a new file, called generated.docx

What I have figured out so far:

  • If I don't fill in any information in the table, and just copy it from the findings templates into the report, I still get the above error message.
  • If I insert other data into the report without the table from the findings templates the document is all good with no errors.
  • The source files have no errors, at least Word doesn't complain when opening either the findings document or the report document.
  • If I let Word correct the errors, all hyperlinks in the document are broken, the text for the link is there along with the link style, but the target is missing, and when looking at the document after hitting alt+F9, you can see { HYPERLINK } indicating the missing target as well.

After quite a bit of googling and finding some similar answers that haven't resolved the issue, I feel like this might be relevant. The tables in the findings document contain a large number of merged cells. It is only one table, not nested tables as I initially thought they were.

Heading is 2 rows deep with 4 merged cells on the left for the finding title and then on the right are two columns with headings and relevant data below. Then the body of the table is a mixture of merged cells per row. Some rows will have all cells merged, others with have 2 cells merged out of 3.

Here is the code I am using to snag the table from the findings document:

for table in findings_templates.tables:
    row = table.rows[0]
    for cell in row.cells:
        if title.lower() in cell.text.lower():
            severity = get_severity_from_template(table)
            for item in severity_array:
                if severity in item[1]:
                    anchor = item[0]

            # snip
            # Insert some data into table here
            # snip

            addTableAfterParagraph(report_document, table, title)
            return True

Since the errors occur with our without modification, ill leave out the modification code. Here is the code that inserts the table into the template document:

def addTableAfterParagraph(report_document, table, title):
    for para in report_document.paragraphs:
        if para.text == title:
            p = para._p
            p.addnext(table._tbl)

Additionally, I added some print lines for table._tbl.xml and I don't see much of a difference between the source table and the one inserted into the document except for the first line has a few differing xmlns tags.

I'd love some troubleshooting tips, or any suggestions. Let me know if any more information is needed. Thanks in advance!

UPDATE: It's the hyperlinks in the source table that are causing the issue. I'm marking this solved for now and may open another more specific question if I can't figure it out.


Solution

  • I ended up reading data from the source document tables, then creating my own tables programmatically, and inserting that data back in along with performing any transforms, such as creating hyperlinks, styles, etc.

    It was painful, but ultimately solved the issue and provides flexibility in the future.