The problem: given a 20gb xml file with the following structure
<root>
<outer>
<inner prop="x">...</inner>
<inner prop="y">...</inner>
</outer>
<outer>
<inner prop="z">...</inner>
<inner prop="f">...</inner>
</outer>
....
....
</root>
How could one xpath count(//outer[inner/@prop="x" and inner/@prop="y"])
efficiently?
I have tried xmllint, pcregreg, xmlstarlet, xml_grep in Linux, even awk and grep but I keep getting the system out of memory.
I was considering python sax module, but haven't found anything relevant and I also don't know how such an xpath like count could work with streaming. It would also be great if sax could somehow ignore inner text, as the file in questions contains several unescaped characters which render the xml not well formed...
Tough one
For large xml files you can use iterparse()
and clear()
memory:
from lxml import etree
def count_matching_outer_elements(xml_file):
count = 0
context = etree.iterparse(xml_file, events=('end',), tag='outer')
for _, outer_elem in context:
# XPath to check if the 'outer' element contains both 'inner' elements with 'prop="x"' and 'prop="y"'
inner_props = outer_elem.xpath('.//inner[@prop="x"] and .//inner[@prop="y"]')
# If both conditions are met, increment the counter
if inner_props:
count += 1
outer_elem.clear()
return count
xml_file = '20GB.xml'
result = count_matching_outer_elements(xml_file)
print(f"Number of matched <outer> tag: {result}")