Search code examples
pythonxml-parsinglxmlxml-namespacesiterparse

Use iterparse and, subsequently, xpath on documents with inconsistent namespace declarations


I need to put together a piece of code that parses a possibly large XML file into custom Python objects. The idea is roughly the following:

from lxml import etree
for e, tag in etree.iterparse(source, tag='Foo'):
    print tag.xpath('bar/baz')[42] # there's actually a function call here

The problem is, some of the documents have a namespace declaration, and some don't have any. That means that in the code above both tag='Foo' and xpath parts won't work.

For now I've been putting up with the ugly

for e, tag in etree.iterparse(source):
    if tag.tag.endswith('Foo'):
        print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]

but this is so awful that I want to get it right even though it works fine. (I guess it should be slower, too.)

Is there a way to write sane code that would account for both cases using iterparse? For now I can only think of catching start-ns and end-ns events and updating a "state-keeping" variable, which I'll have to pass to the function that is called within the loop to do the work. The function will then construct the xpath queries accordingly. This makes some sense, but I'm wondering if there's a simpler way around this.

P.S. I've obviously tried searching around, but haven't found a solution that would work both with and without a namespace. I would also accept a solution that eliminates namespaces from the XML, but only if it doesn't store the whole tree in RAM in the process.


Solution

  • All elements have a .nsmap mapping attribute; use it to detect your namespace and branch accordingly.