I am trying to parse Word Documents using their XML; I am doing this through using Python's
xml.etree.ElementTree
module. This is the code I used to create a plain .txt file output for a given Word Document the user selects.
from tkinter import filedialog
import zipfile
import xml.etree.ElementTree as ET
file_path = filedialog.askopenfilename()
def get_export_raw_xml(doc_filename):
raw_xml = zipfile.ZipFile(doc_filename).read('word/document.xml')
temp_root = ET.fromstring(raw_xml)
FileXML = open("TempDocXML.txt", "w")
FileXML.write(str(ET.tostring(temp_root)))
FileXML.close()
get_export_raw_xml(file_path)
The resultant XML text in the output .txt file is not formatted nor clean but I just take that and put it to a web tool that formats it for me by adding tabs and making it look like a structured XML script (Link: https://jsonformatter.org/xml-formatter).
The word file I am using is a test word file I created which looks like this:
As it can be seen in the image, there are headings (collapsible) titles such as 'CHAPTER 1— GENERAL' and 'Section 1.01 First Section'. The words 'CHAPTER 1-' and 'Section 1.01' effectively act as 'bullet points' in part of the defined multi-level listing in this word document (as an example).
Now supposedly, the XML script of any Word Document should reveal everything, including the text content of these 'bullet point' multi-level listing. But when I extract that, it looks something like this (this is just a portion of the XML script):
<ns0:body>
<ns0:p ns2:paraId="72DB1B6D" ns2:textId="78A4569C" ns0:rsidR="00714955" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00936E3D" ns0:rsidP="00936E3D">
<ns0:pPr>
<ns0:jc ns0:val="center" />
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
</ns0:pPr>
<ns0:r ns0:rsidRPr="00A41AB5">
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
<ns0:t>THIS IS A TEST DOCUMENT</ns0:t>
</ns0:r>
</ns0:p>
<ns0:p ns2:paraId="4EF8AB7A" ns2:textId="1B2FDA0C" ns0:rsidR="00F20298" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00477BC4" ns0:rsidP="00477BC4">
<ns0:pPr>
<ns0:pStyle ns0:val="Heading1" />
<ns0:jc ns0:val="center" />
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
</ns0:rPr>
</ns0:pPr>
<ns0:r ns0:rsidRPr="00A41AB5">
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:color ns0:val="auto" />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
<ns0:t>GENERAL</ns0:t>
</ns0:r>
</ns0:p>
<ns0:p ns2:paraId="6EA4E0BE" ns2:textId="162341F0" ns0:rsidR="004D5F78" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00D254AC" ns0:rsidP="00477BC4">
<ns0:pPr>
<ns0:pStyle ns0:val="Heading2" />
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:color ns0:val="auto" />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
</ns0:pPr>
<ns0:r ns0:rsidRPr="00A41AB5">
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:color ns0:val="auto" />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
<ns0:tab />
</ns0:r>
<ns0:r ns0:rsidR="002636FE" ns0:rsidRPr="00A41AB5">
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:color ns0:val="auto" />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
<ns0:t>First Section</ns0:t>
</ns0:r>
</ns0:p>
With the extracted XML script output, I tried to look for the text of the multi-level listing that was previously defined in the Word Document. I do not know of a way to look for it or where should I be looking for it, but how come it's not there?
<ns0:r ns0:rsidRPr="00A41AB5">
<ns0:rPr>
<ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
<ns0:b />
<ns0:bCs />
<ns0:color ns0:val="auto" />
<ns0:sz ns0:val="24" />
<ns0:szCs ns0:val="24" />
</ns0:rPr>
<ns0:t>GENERAL</ns0:t>
</ns0:r>
Within this text-run tags enclosed block, I do not see the words (for example) 'CHAPTER 1-'... or 'Section 1.01' in the XML script tree, why is that and how can I "find" them?
Someone has set the style "Heading1" to have a very special numbering. But numberings are not in /word/document.xml
of the *.docx
ZIP archive. So you will not see it there.
In the paragraph's paragraph properties there is <ns0:pStyle ns0:val="Heading1" />
. This links to /word/styles.xml
in the *.docx
ZIP archive.
In /word/styles.xml
you will find something like
<w:style w:type="paragraph" w:styleId="Heading1">
...
<w:pPr>
...
<w:numPr>
<w:numId w:val="1"/>
</w:numPr>
...
The numId
(1 is an example) links to /word/numbering.xml
in the *.docx
ZIP archive.
In /word/numbering.xml
you will find something like
...
<w:num w:numId="1" ...>
<w:abstractNumId w:val="0"/>
</w:num>
...
The abstractNumId
(0 is an example) points to an abstractNum
in same /word/numbering.xml
. This will look like so:
...
<w:abstractNum w:abstractNumId="0" ...>
...
<w:lvl w:ilvl="0" ...>
<w:start w:val="1"/>
...
<w:pStyle w:val="Heading1"/>
<w:lvlText w:val="CHAPTER %1---"/>
...
Same for style Heading2
and numbering text "Section %1.%2".
Got it`?
Conclusion: To get the numbered headings exact as Word will show them, one would must parse three XML files, /word/document.xml
, /word/styles.xml
and /word/numbering.xml
according to the linking Ids.