I want to iterate on every "p" tags in a XML document and be able to get the current element's xpath but I don't find anything that does it.
The kind of code I tried:
from bs4 import BeautifulSoup
xml_file = open("./data.xml", "rb")
soup = BeautifulSoup(xml_file, "lxml")
for i in soup.find_all("p"):
print(i.xpath) # xpath doesn't work here (None)
print("\n")
Here is a sample XML file that I try to parse:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
</body>
</article>
I would like my code to output:
/article/body/p[0]
/article/body/p[1]
Here's how to do it with Python's ElementTree class.
It uses a simple list to track an iterator's current path through the XML. Whenever you want the XPath for an element, call gen_xpath()
to turn that list into the XPath for that element, with logic for dealing with same-named siblings (absolute position).
from xml.etree import ElementTree as ET
# A list of elements pushed and popped by the iterator's start and end events
path = []
def gen_xpath():
'''Start at the root of `path` and figure out if the next child is alone, or is one of many siblings named the same. If the next child is one of many same-named siblings determine its position.
Returns the full XPath up to the element in the iterator this function was called.
'''
full_path = '/' + path[0].tag
for i, parent_elem in enumerate(path[:-1]):
next_elem = path[i+1]
pos = -1 # acts as counter for all children named the same as next_elem
next_pos = None # the position we care about
for child_elem in parent_elem:
if child_elem.tag == next_elem.tag:
pos += 1
# Compare etree.Element identity
if child_elem == next_elem:
next_pos = pos
if next_pos and pos > 0:
# We know where next_elem is, and that there are many same-named siblings, no need to count others
break
# Use next_elem's pos only if there are other same-named siblings
if pos > 0:
full_path += f'/{next_elem.tag}[{next_pos}]'
else:
full_path += f'/{next_elem.tag}'
return full_path
# Iterate the XML
for event, elem in ET.iterparse('input.xml', ['start', 'end']):
if event == 'start':
path.append(elem)
if elem.tag == 'p':
print(gen_xpath())
if event == 'end':
path.pop()
When I run that on this modified sample XML, input.xml:
<?xml version="1.0" encoding="UTF-8"?>
<article>
<title>Sample document</title>
<body>
<p>This is a <b>sample document.</b></p>
<p>And there is another paragraph.</p>
<section>
<p>Parafoo</p>
</section>
</body>
</article>
I get:
/article/body/p[0]
/article/body/p[1]
/article/body/section/p