I have .docx
documents that I need to parse, extracting numbered lists as they appear in the document. For example:
However, different approaches have not resulted in anything. What I've tried so far:
Using tools like python-docx
. Does not extract lists at all.
Parsing the document XML
directly. Fails when the w:numId
value changes randomly inside a list, I suspect due to human error when creating the document... No idea how to handle this.
Using pandoc
to convert the .docx
to a string. Closest approach so far, yet does not keep the identical structure - output looks like this:
extra_args=["--number-sections"]
does not change anything. I could parse the string after the conversion with a script, but I'm leaving that for a last-resort solution and instead hoping for a cleaner solution.
Any ideas on how to solve this? Seems like a trivial task, yet it has been driving me mad the last couple of days. Thank you in advance!
aspose-words
seems to extract lists correctly:) Thank you @Daviid !
import aspose.words as aw
doc = aw.Document(FILEPATH)
#convert to .txt
doc.save(os.path.basename(FILEPATH)+".txt")
#read as string
docstr = open(os.path.basename(FILEPATH)+".txt", "r").read()