Search code examples
pythonxmlxpathtags

Iterate on XML tags and get elements' xpath in Python


I want to iterate on every "p" tags in a XML document and be able to get the current element's xpath but I don't find anything that does it.

The kind of code I tried:

from bs4 import BeautifulSoup

xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")

for i in soup.find_all("p"):
    print(i.xpath) # xpath doesn't work here (None)
    print("\n")

Here is a sample XML file that I try to parse:

<?xml version="1.0" encoding="UTF-8"?>

<article>
    <title>Sample document</title>
    <body>
        <p>This is a <b>sample document.</b></p>
        <p>And there is another paragraph.</p>
    </body>
</article>

I would like my code to output:

/article/body/p[0]
/article/body/p[1]

Solution

  • Here's how to do it with Python's ElementTree class.

    It uses a simple list to track an iterator's current path through the XML. Whenever you want the XPath for an element, call gen_xpath() to turn that list into the XPath for that element, with logic for dealing with same-named siblings (absolute position).

    from xml.etree import ElementTree as ET
    
    # A list of elements pushed and popped by the iterator's start and end events
    path = []
    
    
    def gen_xpath():
        '''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same.  If the next child is one of many same-named siblings determine its position.
    
        Returns the full XPath up to the element in the iterator this function was called.
        '''
        full_path = '/' + path[0].tag
    
        for i, parent_elem in enumerate(path[:-1]):
            next_elem = path[i+1]
    
            pos = -1         # acts as counter for all children named the same as next_elem
            next_pos = None  # the position we care about
    
            for child_elem in parent_elem:
                if child_elem.tag == next_elem.tag:
                    pos += 1
    
                # Compare etree.Element identity
                if child_elem == next_elem:
                    next_pos = pos
    
                if next_pos and pos > 0:
                    # We know where next_elem is, and that there are many same-named siblings, no need to count others
                    break
    
            # Use next_elem's pos only if there are other same-named siblings
            if pos > 0:
                full_path += f'/{next_elem.tag}[{next_pos}]'
            else:
                full_path += f'/{next_elem.tag}'
    
        return full_path
    
    
    # Iterate the XML
    for event, elem in ET.iterparse('input.xml', ['start', 'end']):
        if event == 'start':
            path.append(elem)
            if elem.tag == 'p':
                print(gen_xpath())
    
        if event == 'end':
            path.pop()
    

    When I run that on this modified sample XML, input.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <article>
        <title>Sample document</title>
        <body>
            <p>This is a <b>sample document.</b></p>
            <p>And there is another paragraph.</p>
            <section>
                <p>Parafoo</p>
            </section>
        </body>
    </article>
    

    I get:

    /article/body/p[0]
    /article/body/p[1]
    /article/body/section/p