I want to parse the structure of a docx file and its content using python-docx. The file ist structured using 'Heading 1' to 'Heading 6'. Under any heading content could be in form of an table element.
I understand how to extract the headings and the tables independent of each other, using python-docx:
doc = Document("file.docx")
for paragraph in doc.paragraphs:
if paragraph.style == doc.styles['Heading 1']:
indent = 1
result.append('- %s' % paragraph.text.strip())
elif paragraph.style == doc.styles['Heading 2']:
indent = 2
result.append(' ' * indent + '- %s:' % paragraph.text.strip())
elif paragraph.style == doc.styles['Heading 3']:
indent = 3
result.append(' ' * indent + '- %s:' % paragraph.text.strip())
for table in doc.tables:
if _is_content(table.row_cells(0)[0].text):
My problem is preserving the structure. How does I find out under with heading a table is in the source document?
You can extract the structured information from docx file using the xml. Try this:
doc = Document("file.docx")
headings = [] #extract only headings from your code
tables = [] #extract tables from your code
tags = []
all_text = []
schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
for elem in doc.element.getiterator():
if elem.tag == schema + 'body':
for i, child in enumerate(elem.getchildren()):
if child.tag != schema + 'tbl':
node_text = child.text
if node_text:
if node_text in headings:
After above code you will have the list of tags which will show the structure of document heading,text and table then you can map the respective data from the lists.