Search code examples
pythonpython-3.xpython-docx

python-docx: Find the heading name in which a table lies inside a ms word document


I'm struggling to find the heading name in which a table lies, I'm using python-docx library, I'd like to know the possibility I can use to get the table along its heading name in which it lies inside.

from docx import Document
from docx.shared import Inches
document = Document('test.docx')

tabs = document.tables

Solution

  • You can extract the structured information from docx file using the xml. Try this:

    doc = Document("file.docx")
    headings = [] #extract only headings from your code
    tables = [] #extract tables from your code
    tags = []
    all_text = []
    schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    for elem in doc.element.getiterator():
        if elem.tag == schema + 'body':
            for i, child in enumerate(elem.getchildren()):
                if child.tag != schema + 'tbl':
                     node_text = child.text
                     if node_text:
                         if node_text in headings:
                             tags.append('heading')
                         else:
                             tags.append('text')
                         all_text.append(node_text)
                 else:
                     tags.append('table')
            break
    

    After the above code you will have the list of tags that will show the structure of document heading, text and table then you can map the respective data from the lists.

    Also check the data from tag list to get heading of a table.