Search code examples
pythonxmlparsingcelementtree

How can I know the parent of an element when using iterparse methode of cElementTree?


I want to loop trough the elements of an xml file and yield every element, unless the parent is a feature.

So in pseudocode

    for event, element in cElementTree.iterparse('../test.xml'):
        if parentOf_element != 'feature':
        yield element

How can I get the parent of the element? I know it's possible with the tree.getiterator() function, but I don't want to build the full tree because the xml files are a few gigs big.


Solution

  • If you enable start events, you can track ancestor nodes by using a stack. If you really mean to suppress all descendants of a <feature>, instead of just children, you can use a simple flag as demonstrated in another answer.

    You can use root.clear() to blow away all finished-with elements. Read this.

    Code:

    import xml.etree.cElementTree as et
    # Produces identical answers with import lxml.etree as et
    import cStringIO
    
    def normtext(t):
        return repr("" if t is None else t.strip())
    
    def dump(el):
        print el.tag, normtext(el.text), normtext(el.tail), el.attrib
    
    def my_filtered_elements(source, skip_parent_tag="feature"):
        # get an iterable
        context = et.iterparse(source, events=("start", "end"))
        # turn it into an iterator
        context = iter(context)
        # get the root element
        event, root = context.next()
        tag_stack = [None, root.tag]
        for event, elem in context:
            # print event, elem.tag, tag_stack
            if event == "start":
                tag_stack.append(elem.tag)
            else:
                assert event == "end"
                my_tag = tag_stack.pop()
                assert my_tag == elem.tag
                parent_tag = tag_stack[-1]
                if parent_tag is not None and parent_tag != skip_parent_tag:
                    dump(elem)
                    # yield elem
                root.clear()
    
    def other_filtered_elements(source, skip_parent_tag="feature"):            
        in_feature_tag = False
        for event, element in et.iterparse(source, events=('start', 'end')):
            if element.tag == skip_parent_tag:
                in_feature_tag = event == 'start'
            if event == 'end' and not in_feature_tag:
                dump(element)            
    
    test_input = """
    <top>
        <lev1 guff="1111">
            <lev2>aaaaa</lev2>
            <lev2>bbbbb</lev2>
        </lev1>
        <feature>
            feat text 1
            <fchild>fcfcfcfc
                <fgchild>ggggg</fgchild>    
            </fchild>
            feat text 2
        </feature>
        <lev1 guff="2222">
            <lev2>ccccc</lev2>c-tail
            <lev2>ddddd</lev2>d-tail
            <notext1></notext1>e-tail
            <notext2 />f-tail
         </lev1>g-tail
    </top>
    """
    
    print "=== me ==="
    my_filtered_elements(cStringIO.StringIO(test_input))
    print "=== other ==="
    other_filtered_elements(cStringIO.StringIO(test_input))
    

    Output is below. You'll notice from the lev1 nodes that root.clear() doesn't blow away elements that haven't been fully parsed yet. This means that the amount of memory used is O(depth of tree), not O(total number of elements in the tree)

    === me ===
    lev2 'aaaaa' '' {}
    lev2 'bbbbb' '' {}
    lev1 '' '' {'guff': '1111'}
    fgchild 'ggggg' '' {}          <<<=== do you want this?
    feature 'feat text 1' '' {}
    lev2 'ccccc' 'c-tail' {}
    lev2 'ddddd' 'd-tail' {}
    notext1 '' 'e-tail' {}
    notext2 '' 'f-tail' {}
    lev1 '' 'g-tail' {'guff': '2222'}
    === other ===
    lev2 'aaaaa' '' {}
    lev2 'bbbbb' '' {}
    lev1 '' '' {'guff': '1111'}
    feature 'feat text 1' '' {}
    lev2 'ccccc' 'c-tail' {}
    lev2 'ddddd' 'd-tail' {}
    notext1 '' 'e-tail' {}
    notext2 '' 'f-tail' {}
    lev1 '' 'g-tail' {'guff': '2222'}
    top '' '' {}                           <<<=== do you want this?