Search code examples
pythonpython-3.xms-wordpython-docx

Getting the List Numbers of List Items in docx file using Python-Docx


When I am accessing paragraph text it does not include the numbering in a list.

Current code:

document = Document("C:/Foo.docx")
for p in document.paragraphs:
     print(p.text)

List in docx file:

Numbered List

I am expecting:
(1) The naturalization of both ...
(2) The naturalization of the ...
(3) The naturalization of the ...

What I get:
The naturalization of both ...
The naturalization of the ...
The naturalization of the ...

Upon checking the XML of the document, the list numbers are stored in w:abstructNum but I have no idea how to access them or connect them to the proper list item. How can I access the number for each list item in python-docx so they could be included in my output? Is there a way also to determine the proper nesting of these lists using python-docx?


Solution

  • According to [ReadThedocs.Python-DocX]: Style-related objects - _NumberingStyle objects, this functionality is not implemented yet.
    The alternative (at least one of them) [PyPI]: docx2python is kind of poor handling these elements (mainly because it returns everything converted to strings).

    So, a solution would be to parse the XML files manually - discovered how empirically, working on this very example. A good documentation place is Office Open XML (I don't know whether it's a standard followed by all the tools that deal with .docx files (especially MS Word)):

    • Get each paragraph (w:p node) from word/document.xml
      • Check whether it's a numbered item (it has w:pPr -> w:numPr) subnode

      • Get the number style Id and level: w:val attribute of w:numId and w:ilvl subnodes (of the node from previous bullet)

      • Match the 2 values with (in word/numbering.xml):

        • w:abstractNumId attribute of w:abstractNum node
        • w:ilvl attribute of w:lvl subnode

        and get the w:val attribute of the corresponding w:numFmt and w:lvlText subnodes (note that bullets are included as well, they can be discriminated based on the bullet value for aforementioned w:numFmt's attribute)

    However that seems extremely complex, so I'm proposing a workaround (gainarie) that makes use of docx2pythons partial support.

    Test document (sample.docx - created with LibreOffice):

    Img0

    code00.py:

    #!/usr/bin/env python
    
    import sys
    import docx
    from docx2python import docx2python as dx2py
    
    
    def ns_tag_name(node, name):
        if node.nsmap and node.prefix:
            return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
        return name
    
    
    def descendants(node, desc_strs):
        if node is None:
            return []
        if not desc_strs:
            return [node]
        ret = {}
        for child_str in desc_strs[0]:
            for child in node.iterchildren(ns_tag_name(node, child_str)):
                descs = descendants(child, desc_strs[1:])
                if not descs:
                    continue
                cd = ret.setdefault(child_str, [])
                if isinstance(descs, list):
                    cd.extend(descs)
                else:
                    cd.append(descs)
        return ret
    
    
    def simplified_descendants(desc_dict):
        ret = []
        for vs in desc_dict.values():
            for v in vs:
                if isinstance(v, dict):
                    ret.extend(simplified_descendants(v))
                else:
                    ret.append(v)
        return ret
    
    
    def process_list_data(attrs, dx2py_elem):
        #print(simplified_descendants(attrs))
        desc = simplified_descendants(attrs)[0]
        level = int(desc.attrib[ns_tag_name(desc, "val")])
        elem = [i for i in dx2py_elem[0].split("\t") if i][0]#.rstrip(")")
        return "    " * level + elem + " "
    
    
    def main(*argv):
        fname = r"./sample.docx"
        docd = docx.Document(fname)
        docdpy = dx2py(fname)
        dr = docdpy.docx_reader
        #print(dr.files)  # !!! Check word/numbering.xml !!!
        docdpy_runs = docdpy.document_runs[0][0][0]
        if len(docd.paragraphs) != len(docdpy_runs):
            print("Lengths don't match. Abort")
            return -1
        subnode_tags = (("pPr",), ("numPr",), ("ilvl",))  # (("pPr",), ("numPr",), ("ilvl", "numId"))  # numId is for matching elements from word/numbering.xml
        for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
            #print(par.text, l)
            numbered_attrs = descendants(par._element, subnode_tags)
            #print(numbered_attrs)
            if numbered_attrs:
                print(process_list_data(numbered_attrs, l) + par.text)
            else:
                print(par.text)
    
    
    if __name__ == "__main__":
        print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                       64 if sys.maxsize > 0x100000000 else 32, sys.platform))
        rc = main(*sys.argv[1:])
        print("\nDone.")
        sys.exit(rc)
    

    Output:

    [cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q066374154]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py
    Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32
    
    Doc title
    doc subtitle
    
    heading1 text0
    
    Paragr0 line0
    Paragr0 line1
    Paragr0 line2
    
    space Paragr0 line3
    a) aa (numbered)
    heading1 text1
    Paragrx line0
    Paragrx line1
            a)      w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below)
    
    1) paragrx 1x (numbered)
        a) paragrx 1ax (numbered)
            I) paragrx 1aIx (numbered)
        b) paragrx 1bx (numbered)
    2) paragrx 2x (numbered)
    3) paragrx 3x (numbered)
    
    -- paragrx bullet 0
        -- paragrx bullet 00
    
    paragxx text
    
    Done.
    

    Notes:

    • Only nodes from word/document.xml are processed (via paragraph's _element (LXML node) attribute)
    • Some list attributes are not captured (due to docx2python's limitations)
    • This is far away from being robust
    • descendants, simplified_descendants can be much simplified, but I wanted to keep the former as generic as possible (if functionality needs to be extended)