Search code examples
pythonxmlxpathlxml

lxml xpath syntax to access the ancestor of an XML element of specific depth?


I am trying to access ancestors of depth 3 in an XML file, i.e. for element /a/b/c/d/e/f, I want to get element c.

Here is my more realistic example input file:

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:QDA-XML:project:1.0">
  <Sources>
    <TextSource name="document example">
      <Description />
      <PlainTextSelection>
        <Description />
        <Coding>
          <CodeRef targetGUID="a2a627dd-f7e7-4fc7-b8db-918e3ad50450" />
        </Coding>
      </PlainTextSelection>
    </TextSource>
    <VideoSource name="myvideo">
      <Transcript>
        <SyncPoint/>
        <SyncPoint/>
        <TranscriptSelection>
          <Description />
          <Coding>
            <CodeRef targetGUID="a2a627dd-f7e7-4fc7-b8db-918e3ad50450" />
          </Coding>
        </TranscriptSelection>
      </Transcript>
      <VideoSelection>
        <Coding>
          <CodeRef targetGUID="a2a627dd-f7e7-4fc7-b8db-918e3ad50450" />
        </Coding>
      </VideoSelection>
    </VideoSource>
  </Sources>
  <Notes>
    <Note name="some text">
      <Description />
      <PlainTextSelection>
        <Description />
        <Coding>
          <CodeRef targetGUID="a2a627dd-f7e7-4fc7-b8db-918e3ad50450" />
        </Coding>
      </PlainTextSelection>
    </Note>
  </Notes>
</Project>

In this case for instance, I want to access the elements Note, TextSource and VideoSource that are ancestors of CodeRef elements.

I have the following working code, but am wondering if there is a nicer way to go about it, perhaps using Xpath syntax:

import lxml.etree as ET

tree = ET.parse('coderef_examples/project_simplified.xml')
root = tree.getroot()

for i in root.findall('.//CodeRef', root.nsmap):
    p = tree.getelementpath(i)
    p = p.replace('{urn:QDA-XML:project:1.0}', '')
    print('namespace-free path: ', p)

    p = tree.getpath(i) # Xpath
    s = '/'.join(p.split('/')[:4]) # Xpath of depth 3
    print('xpath string: ', s)
    ancestor = root.xpath(s)[0]
    print('source tag: ', ancestor.tag, ', source name: ', ancestor.get('name'))

Output:

namespace-free path:  Sources/TextSource/PlainTextSelection/Coding/CodeRef
xpath string:  /*/*[1]/*[1]
source tag:  {urn:QDA-XML:project:1.0}TextSource , source name:  document example
namespace-free path:  Sources/VideoSource/Transcript/TranscriptSelection/Coding/CodeRef
xpath string:  /*/*[1]/*[2]
source tag:  {urn:QDA-XML:project:1.0}VideoSource , source name:  myvideo
namespace-free path:  Sources/VideoSource/VideoSelection/Coding/CodeRef
xpath string:  /*/*[1]/*[2]
source tag:  {urn:QDA-XML:project:1.0}VideoSource , source name:  myvideo
namespace-free path:  Notes/Note/PlainTextSelection/Coding/CodeRef
xpath string:  /*/*[2]/*
source tag:  {urn:QDA-XML:project:1.0}Note , source name:  some text

Can it be done directly in Xpath? (ideally in a way that is independent of the tag of the ancestor and the depth of the CodeRef elements)

edit: Solution based on Conal Tuohy's answer:

import lxml.etree as ET

tree = ET.parse('coderef_examples/project_simplified.xml')
root = tree.getroot()

for ancestor in root.xpath('/*/*/*[descendant::qda:CodeRef]', namespaces={'qda': 'urn:QDA-XML:project:1.0'}):
    print('source tag: ', ancestor.tag, ', source name: ', ancestor.get('name'))

Much faster and much more efficient. It may not print an entry for every CodeRef node, but since I only want the unique ancestors, it is even better.


Solution

  • This XPath will return all elements which are at depth 3:

    /*/*/*
    

    (read as "any element which is the child of an element which is the child of an element which is the child of the document root")

    You mention that you want elements which are ancestors of codeRef elements. To add that as a filter, you could do this:

    /*/*/*[descendant::qda:codeRef]
    

    (where qda is a namespace prefix bound to the URI urn:QDA-XML:project:1.0)