Search code examples
pythonlxmliterparse

Is there a way to skip nodes/elements with iterparse lxml?


Is there a way using lxml iterparse to skip an element without checking the tag? Take this xml for example:

<root>
    <sample>
        <tag1>text1</tag1>
        <tag2>text2</tag2>
        <tag3>text3</tag3>
        <tag4>text4</tag4>
    </sample>
    <sample>
        <tag1>text1</tag1>
        <tag2>text2</tag2>
        <tag3>text3</tag3>
        <tag4>text4</tag4>
    </sample>
</root>
    

If I care about tag1 and tag4, checking tag2 and tag3 will eat up some time. If the file isn't big, it doesn't really matter but if I have a million <sample> nodes, I could reduce search time some if I don't have to check tag2 nd tag3. They're always there and I never need them.

using iterparse in lxml

import lxml

xmlfile = 'myfile.xml'
context = etree.iterparse(xmlfile, events('end',), tag='sample')

for event, elem in context:
    for child in elem:
        if child.tag == 'tag1'
            my_list.append(child.text)

            #HERE I'd like to advance the loop twice without checking tag2 and tag3 at all
            #something like:

            #next(child)
            #next(child)

        elif child.tag == 'tag4'
             my_list.append(child.text)
    

Solution

  • If you use the tag arg in iterchildren like you do in iterparse, you can "skip" elements other than tag1 and tag4.

    Example...

    from lxml import etree
    
    xmlfile = "myfile.xml"
    
    my_list = []
    
    for event, elem in etree.iterparse(xmlfile, tag="sample"):
        for child in elem.iterchildren(tag=["tag1", "tag4"]):
            if child.tag == "tag1":
                my_list.append(child.text)
            elif child.tag == "tag4":
                my_list.append(child.text)
    
    print(my_list)
    

    Printed output...

    ['text1', 'text4', 'text1', 'text4']