Search code examples
pythonlinuxxmlxpath

Xpath on 20gb xml file


The problem: given a 20gb xml file with the following structure

<root>
 <outer>
   <inner prop="x">...</inner>
   <inner prop="y">...</inner>
 </outer>
 <outer>
   <inner prop="z">...</inner>
   <inner prop="f">...</inner>
 </outer>
....
....
</root>

How could one xpath count(//outer[inner/@prop="x" and inner/@prop="y"]) efficiently?

I have tried xmllint, pcregreg, xmlstarlet, xml_grep in Linux, even awk and grep but I keep getting the system out of memory.

I was considering python sax module, but haven't found anything relevant and I also don't know how such an xpath like count could work with streaming. It would also be great if sax could somehow ignore inner text, as the file in questions contains several unescaped characters which render the xml not well formed...

Tough one


Solution

  • For large xml files you can use iterparse() and clear() memory:

    from lxml import etree
    
    def count_matching_outer_elements(xml_file):
        count = 0
        context = etree.iterparse(xml_file, events=('end',), tag='outer')
        for _, outer_elem in context:
            # XPath to check if the 'outer' element contains both 'inner' elements with 'prop="x"' and 'prop="y"'
            inner_props = outer_elem.xpath('.//inner[@prop="x"] and .//inner[@prop="y"]')
            # If both conditions are met, increment the counter
            if inner_props:
                count += 1
            outer_elem.clear()
        return count
    
    xml_file = '20GB.xml'
    result = count_matching_outer_elements(xml_file)
    print(f"Number of matched <outer> tag: {result}")