Search code examples
pythonxpathms-worddocxpython-docx

Attribute error with .docx document no attribute 'xpath'


from docx import *
document = Document(r'filepath.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
WPML_URI = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main'
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'
tag_t = WPML_URI + 't'
for word in words:
    for rPr in word.findall(tag_rPr):
        high = rPr.findall(tag_highlight)
        for hi in high:
            if hi.attribute[tag_val] == 'yellow':
                print(word.find(tag_t).text.encode('utf-8').lower())

this code in theory should get the document text and then find the highlighted text in yellow, but my problem is at the start i run the code as is, and i get AttributeError: 'Document' object has no attribute 'xpath' as the error message. its problem is apparently with words = document.xpath('//w:r', namespaces=document.nsmap) and I don't know how to fix


Solution

  • @PirateNinjas is right on. The Document object does not subclass lxml.etree._Element and so does not have the .xpath() method. This is what AttributeError indicates; each method on an object is an attribute (just like an instance variable is) and if one with the name you ask for isn't there, you get this error.

    However, Document._element does subclass _Element and may work for you. At least it won't give you this error and should move you further in the right direction. This code should give you all the <w:r> elements in the main story of the document (i.e. document body, but not headers, footnotes, etc.):

    rs = document._element.xpath("//w:r")