Search code examples
pythonxmlms-worddocxpython-docx

How to extract image from table in MS Word document with docx library?


I am working on a program that needs to extract two images from a MS Word document to use them in another document. I know where the images are located (first table in the document), but when I try to extract any information from the table (even just plain text), I get empty cells.

Here is the Word document that I want to extract the images from. I want to extract the 'Rentel' images from the first page (first table, row 0 and 1, column 2).


I have tried to try the following code:

from docxtpl import DocxTemplate

source_document = DocxTemplate("Source document.docx")

# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)

Which just gives me empty lines...


I have read on this discussion and this one that the problem might be that "contained in a wrapper element that Python Docx cannot read". They suggest altering the source document, but I want to be able to select any document that was previously created with the same template as a source document (so those documents also contain the same problem and I cannot change every document separately). So a Python-only solution is really the only way I can think about solving the problem.


Since I also only want those two specific images, extracting any random image from the xml by unzipping the Word file doesn't really suit my solution, unless I know which image name I need to extract from the unzipped Word file folders.


I really want this to work as it is part of my thesis (and I'm just an electromechanical engineer, so I don't know that much about software).


[EDIT]: Here is the xml code for the first image (source_document.tables[0].cell(0,2)._tc.xml) and here it is for the second image (source_document.tables[0].cell(1,2)._tc.xml). I noticed however that taking (0,2) as row and column value, gives me all the rows in column 2 within the first "visible" table. Cell (1,2) gives me all the rows in column 2 within the second "visible" table.

If the problem isn't directly solvable with Python Docx, is it a possibility to search for the image name or ID or something within the XML code and then add the image using this ID/name with Python Docx?


Solution

  • Well, the first thing that jumps out is that both of the cells (w:tc elements) you posted each contain a nested table. This is perhaps unusual, but certainly a valid composition. Maybe they did that so they could include a caption in a cell below the image or something.

    To access the nested table you'd have to do something like:

    outer_cell = source_document.tables[0].cell(0,2)
    nested_table = outer_cell.tables[0]
    inner_cell_1 = nested_table.cell(0, 0)
    print(inner_cell_1.text)
    # ---etc....---
    

    I'm not sure that solves your whole problem, but it strikes me that this is two or more questions in the end, the first being: "Why isn't my table cell showing up?" and the second perhaps being "How do I get an image out of a table cell?" (once you've actually found the cell in question).