from docx import *
document = Document(r'filepath.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
WPML_URI = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main'
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'
tag_t = WPML_URI + 't'
for word in words:
for rPr in word.findall(tag_rPr):
high = rPr.findall(tag_highlight)
for hi in high:
if hi.attribute[tag_val] == 'yellow':
print(word.find(tag_t).text.encode('utf-8').lower())
this code in theory should get the document text and then find the highlighted text in yellow, but my problem is at the start i run the code as is, and i get
AttributeError: 'Document' object has no attribute 'xpath'
as the error message. its problem is apparently with
words = document.xpath('//w:r', namespaces=document.nsmap)
and I don't know how to fix
@PirateNinjas is right on. The Document
object does not subclass lxml.etree._Element
and so does not have the .xpath()
method. This is what AttributeError
indicates; each method on an object is an attribute (just like an instance variable is) and if one with the name you ask for isn't there, you get this error.
However, Document._element
does subclass _Element
and may work for you. At least it won't give you this error and should move you further in the right direction. This code should give you all the <w:r>
elements in the main story of the document (i.e. document body, but not headers, footnotes, etc.):
rs = document._element.xpath("//w:r")