Search code examples
pythonxmlopenxmlgoogle-docsdocx

Extract DOCX Comments


I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.

The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.

via the tutorial (and official Python docs):

z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)

Then I do this:

children = tree.getiterator()
for c in children:
    print(c.attrib)

Resulting in this:

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}

And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.

Can anyone help?


Solution

  • You got remarkably far considering that OOXML is such a complex format.

    Here's some sample Python code showing how to access the comments of a DOCX file via XPath:

    from lxml import etree
    import zipfile
    
    ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
    
    def get_comments(docxFileName):
      docxZip = zipfile.ZipFile(docxFileName)
      commentsXML = docxZip.read('word/comments.xml')
      et = etree.XML(commentsXML)
      comments = et.xpath('//w:comment',namespaces=ooXMLns)
      for c in comments:
        # attributes:
        print(c.xpath('@w:author',namespaces=ooXMLns))
        print(c.xpath('@w:date',namespaces=ooXMLns))
        # string value of the comment:
        print(c.xpath('string(.)',namespaces=ooXMLns))