Search code examples
pythonxmlparsingms-wordelementtree

Multi-level listing bullet point "text" not showing in XML tree of Word Document's XML


I am trying to parse Word Documents using their XML; I am doing this through using Python's xml.etree.ElementTree module. This is the code I used to create a plain .txt file output for a given Word Document the user selects.

from tkinter import filedialog
import zipfile
import xml.etree.ElementTree as ET

file_path = filedialog.askopenfilename()

def get_export_raw_xml(doc_filename):
    raw_xml = zipfile.ZipFile(doc_filename).read('word/document.xml')
    temp_root = ET.fromstring(raw_xml)
    FileXML = open("TempDocXML.txt", "w")
    FileXML.write(str(ET.tostring(temp_root)))
    FileXML.close() 

get_export_raw_xml(file_path)

The resultant XML text in the output .txt file is not formatted nor clean but I just take that and put it to a web tool that formats it for me by adding tabs and making it look like a structured XML script (Link: https://jsonformatter.org/xml-formatter).

The word file I am using is a test word file I created which looks like this:

Sample Word Document Used

As it can be seen in the image, there are headings (collapsible) titles such as 'CHAPTER 1— GENERAL' and 'Section 1.01 First Section'. The words 'CHAPTER 1-' and 'Section 1.01' effectively act as 'bullet points' in part of the defined multi-level listing in this word document (as an example).

Now supposedly, the XML script of any Word Document should reveal everything, including the text content of these 'bullet point' multi-level listing. But when I extract that, it looks something like this (this is just a portion of the XML script):

<ns0:body>
        <ns0:p ns2:paraId="72DB1B6D" ns2:textId="78A4569C" ns0:rsidR="00714955" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00936E3D" ns0:rsidP="00936E3D">
            <ns0:pPr>
                <ns0:jc ns0:val="center" />
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
            </ns0:pPr>
            <ns0:r ns0:rsidRPr="00A41AB5">
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
                <ns0:t>THIS IS A TEST DOCUMENT</ns0:t>
            </ns0:r>
        </ns0:p>
        <ns0:p ns2:paraId="4EF8AB7A" ns2:textId="1B2FDA0C" ns0:rsidR="00F20298" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00477BC4" ns0:rsidP="00477BC4">
            <ns0:pPr>
                <ns0:pStyle ns0:val="Heading1" />
                <ns0:jc ns0:val="center" />
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                </ns0:rPr>
            </ns0:pPr>
            <ns0:r ns0:rsidRPr="00A41AB5">
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:color ns0:val="auto" />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
                <ns0:t>GENERAL</ns0:t>
            </ns0:r>
        </ns0:p>
        <ns0:p ns2:paraId="6EA4E0BE" ns2:textId="162341F0" ns0:rsidR="004D5F78" ns0:rsidRPr="00A41AB5" ns0:rsidRDefault="00D254AC" ns0:rsidP="00477BC4">
            <ns0:pPr>
                <ns0:pStyle ns0:val="Heading2" />
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:color ns0:val="auto" />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
            </ns0:pPr>
            <ns0:r ns0:rsidRPr="00A41AB5">
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:color ns0:val="auto" />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
                <ns0:tab />
            </ns0:r>
            <ns0:r ns0:rsidR="002636FE" ns0:rsidRPr="00A41AB5">
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:color ns0:val="auto" />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
                <ns0:t>First Section</ns0:t>
            </ns0:r>
        </ns0:p> 

With the extracted XML script output, I tried to look for the text of the multi-level listing that was previously defined in the Word Document. I do not know of a way to look for it or where should I be looking for it, but how come it's not there?

            <ns0:r ns0:rsidRPr="00A41AB5">
                <ns0:rPr>
                    <ns0:rFonts ns0:ascii="Arial" ns0:hAnsi="Arial" ns0:cs="Arial" />
                    <ns0:b />
                    <ns0:bCs />
                    <ns0:color ns0:val="auto" />
                    <ns0:sz ns0:val="24" />
                    <ns0:szCs ns0:val="24" />
                </ns0:rPr>
                <ns0:t>GENERAL</ns0:t>
            </ns0:r>

Within this text-run tags enclosed block, I do not see the words (for example) 'CHAPTER 1-'... or 'Section 1.01' in the XML script tree, why is that and how can I "find" them?


Solution

  • Someone has set the style "Heading1" to have a very special numbering. But numberings are not in /word/document.xml of the *.docx ZIP archive. So you will not see it there.

    In the paragraph's paragraph properties there is <ns0:pStyle ns0:val="Heading1" />. This links to /word/styles.xml in the *.docx ZIP archive.

    In /word/styles.xml you will find something like

    <w:style w:type="paragraph" w:styleId="Heading1">
    ...
     <w:pPr>
     ...
      <w:numPr>
       <w:numId w:val="1"/>
      </w:numPr>
    ...
    

    The numId (1 is an example) links to /word/numbering.xml in the *.docx ZIP archive.

    In /word/numbering.xml you will find something like

    ...
    <w:num w:numId="1" ...>
     <w:abstractNumId w:val="0"/>
    </w:num>
    ...
    

    The abstractNumId (0 is an example) points to an abstractNum in same /word/numbering.xml. This will look like so:

    ...
    <w:abstractNum w:abstractNumId="0" ...>
     ...
     <w:lvl w:ilvl="0" ...>
      <w:start w:val="1"/>
      ...
      <w:pStyle w:val="Heading1"/>
      <w:lvlText w:val="CHAPTER %1---"/>
      ...
    

    Same for style Heading2 and numbering text "Section %1.%2".

    Got it`?

    Conclusion: To get the numbered headings exact as Word will show them, one would must parse three XML files, /word/document.xml, /word/styles.xml and /word/numbering.xml according to the linking Ids.