Search code examples
pythontextdocxdocument

Extracting numbered lists from docx documents as they appear in the document (python 3.X)


I have .docx documents that I need to parse, extracting numbered lists as they appear in the document. For example: enter image description here

However, different approaches have not resulted in anything. What I've tried so far:

  1. Using tools like python-docx. Does not extract lists at all.

  2. Parsing the document XML directly. Fails when the w:numId value changes randomly inside a list, I suspect due to human error when creating the document... No idea how to handle this.

  3. Using pandoc to convert the .docx to a string. Closest approach so far, yet does not keep the identical structure - output looks like this: enter image description here

    extra_args=["--number-sections"] does not change anything. I could parse the string after the conversion with a script, but I'm leaving that for a last-resort solution and instead hoping for a cleaner solution.

Any ideas on how to solve this? Seems like a trivial task, yet it has been driving me mad the last couple of days. Thank you in advance!


Solution

  • aspose-words seems to extract lists correctly:) Thank you @Daviid !

    import aspose.words as aw
    
    doc = aw.Document(FILEPATH)
    
    #convert to .txt
    doc.save(os.path.basename(FILEPATH)+".txt")
    
    #read as string
    docstr = open(os.path.basename(FILEPATH)+".txt", "r").read()