How can I know the parent of an element when using iterparse methode of cElementTree?

I want to loop trough the elements of an xml file and yield every element, unless the parent is a feature.

So in pseudocode

    for event, element in cElementTree.iterparse('../test.xml'):
        if parentOf_element != 'feature':
        yield element

How can I get the parent of the element? I know it's possible with the tree.getiterator() function, but I don't want to build the full tree because the xml files are a few gigs big.

Solution

If you enable start events, you can track ancestor nodes by using a stack. If you really mean to suppress all descendants of a <feature>, instead of just children, you can use a simple flag as demonstrated in another answer.

You can use root.clear() to blow away all finished-with elements. Read this.

Code:

import xml.etree.cElementTree as et
# Produces identical answers with import lxml.etree as et
import cStringIO

def normtext(t):
    return repr("" if t is None else t.strip())

def dump(el):
    print el.tag, normtext(el.text), normtext(el.tail), el.attrib

def my_filtered_elements(source, skip_parent_tag="feature"):
    # get an iterable
    context = et.iterparse(source, events=("start", "end"))
    # turn it into an iterator
    context = iter(context)
    # get the root element
    event, root = context.next()
    tag_stack = [None, root.tag]
    for event, elem in context:
        # print event, elem.tag, tag_stack
        if event == "start":
            tag_stack.append(elem.tag)
        else:
            assert event == "end"
            my_tag = tag_stack.pop()
            assert my_tag == elem.tag
            parent_tag = tag_stack[-1]
            if parent_tag is not None and parent_tag != skip_parent_tag:
                dump(elem)
                # yield elem
            root.clear()

def other_filtered_elements(source, skip_parent_tag="feature"):            
    in_feature_tag = False
    for event, element in et.iterparse(source, events=('start', 'end')):
        if element.tag == skip_parent_tag:
            in_feature_tag = event == 'start'
        if event == 'end' and not in_feature_tag:
            dump(element)            

test_input = """
<top>
    <lev1 guff="1111">
        <lev2>aaaaa</lev2>
        <lev2>bbbbb</lev2>
    </lev1>
    <feature>
        feat text 1
        <fchild>fcfcfcfc
            <fgchild>ggggg</fgchild>    
        </fchild>
        feat text 2
    </feature>
    <lev1 guff="2222">
        <lev2>ccccc</lev2>c-tail
        <lev2>ddddd</lev2>d-tail
        <notext1></notext1>e-tail
        <notext2 />f-tail
     </lev1>g-tail
</top>
"""

print "=== me ==="
my_filtered_elements(cStringIO.StringIO(test_input))
print "=== other ==="
other_filtered_elements(cStringIO.StringIO(test_input))

Output is below. You'll notice from the lev1 nodes that root.clear() doesn't blow away elements that haven't been fully parsed yet. This means that the amount of memory used is O(depth of tree), not O(total number of elements in the tree)

=== me ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
fgchild 'ggggg' '' {}          <<<=== do you want this?
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
=== other ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
top '' '' {}                           <<<=== do you want this?