I want to loop trough the elements of an xml file and yield every element, unless the parent is a feature.
So in pseudocode
for event, element in cElementTree.iterparse('../test.xml'):
if parentOf_element != 'feature':
yield element
How can I get the parent of the element? I know it's possible with the tree.getiterator() function, but I don't want to build the full tree because the xml files are a few gigs big.
If you enable start
events, you can track ancestor nodes by using a stack. If you really mean to suppress all descendants of a <feature>
, instead of just children, you can use a simple flag as demonstrated in another answer.
You can use root.clear()
to blow away all finished-with elements. Read this.
Code:
import xml.etree.cElementTree as et
# Produces identical answers with import lxml.etree as et
import cStringIO
def normtext(t):
return repr("" if t is None else t.strip())
def dump(el):
print el.tag, normtext(el.text), normtext(el.tail), el.attrib
def my_filtered_elements(source, skip_parent_tag="feature"):
# get an iterable
context = et.iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
tag_stack = [None, root.tag]
for event, elem in context:
# print event, elem.tag, tag_stack
if event == "start":
tag_stack.append(elem.tag)
else:
assert event == "end"
my_tag = tag_stack.pop()
assert my_tag == elem.tag
parent_tag = tag_stack[-1]
if parent_tag is not None and parent_tag != skip_parent_tag:
dump(elem)
# yield elem
root.clear()
def other_filtered_elements(source, skip_parent_tag="feature"):
in_feature_tag = False
for event, element in et.iterparse(source, events=('start', 'end')):
if element.tag == skip_parent_tag:
in_feature_tag = event == 'start'
if event == 'end' and not in_feature_tag:
dump(element)
test_input = """
<top>
<lev1 guff="1111">
<lev2>aaaaa</lev2>
<lev2>bbbbb</lev2>
</lev1>
<feature>
feat text 1
<fchild>fcfcfcfc
<fgchild>ggggg</fgchild>
</fchild>
feat text 2
</feature>
<lev1 guff="2222">
<lev2>ccccc</lev2>c-tail
<lev2>ddddd</lev2>d-tail
<notext1></notext1>e-tail
<notext2 />f-tail
</lev1>g-tail
</top>
"""
print "=== me ==="
my_filtered_elements(cStringIO.StringIO(test_input))
print "=== other ==="
other_filtered_elements(cStringIO.StringIO(test_input))
Output is below. You'll notice from the lev1
nodes that root.clear()
doesn't blow away elements that haven't been fully parsed yet. This means that the amount of memory used is O(depth of tree), not O(total number of elements in the tree)
=== me ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
fgchild 'ggggg' '' {} <<<=== do you want this?
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
=== other ===
lev2 'aaaaa' '' {}
lev2 'bbbbb' '' {}
lev1 '' '' {'guff': '1111'}
feature 'feat text 1' '' {}
lev2 'ccccc' 'c-tail' {}
lev2 'ddddd' 'd-tail' {}
notext1 '' 'e-tail' {}
notext2 '' 'f-tail' {}
lev1 '' 'g-tail' {'guff': '2222'}
top '' '' {} <<<=== do you want this?