Search code examples
pythonxmlxpathlxmlopenxml

XPath for w:p without certain ancestor elements?


I have been following an example for using the XPath not ancestor selector in the form of .//x[not(ancestor::w:tbl)] but it is not behaving as I expect.

I am parsing a Word DOCX file that has a table in it. I am using the python lxml library to parse it as XML. I want to get paragraph items that do not have a table element anywhere in their ancestor tree.

I type the following in console:

selector = './/w:p[not(ancestor::w:drawing)][not(ancestor::w:tbl)][not(ancestor::v:textbox)][not(ancestor::wps:wsp)][not(ancestor::mc:Fallback)]
nsDict = {k:v for k,v in doc.nsmap.items() if k}
paragraphs = doc.xpath(selector,namespaces=nsDict)
for p in paragraphs:
    print(bool(p.xpath(".//ancestor::w:tbl",namespaces=nsDict)))
>>>>False
>>>>False
>>>>False
>>>>False
>>>>False
>>>>False
>>>>True
>>>>False

The expected behavior is that the paragraph xpath selector is mutually exclusive with the parent doc element-level xpath selector. The paragraph boolean check should always be False.

How can I amend my initial selector so that no elements are picked up that have w:tbl as ancestors?


Solution

  • Your initial XPath is fine; it is your testing XPath that is faulty.

    Your testing XPath,

    .//ancestor::w:tbl
    

    does not select w:tbl ancestors at the current node; it selects w:tbl ancestors of any of the descendents of the current node.

    The case where a paragraph has no ancestors that are tables but does have a descendent table that contains a paragraph would yield True for your test, for example.

    Change it instead to

    ancestor::w:tbl
    

    to select the w:tbl ancestors of the current node.