I'm new to xml and python and I hope that I phrased my problem right:
I have xml files with a size of one gigabyte. The files look like this:
<test name="LongTestname" result="PASS">
<step ID="0" step="NameOfStep1" result="PASS">
Stuff I dont't care about
</step>
<step ID="1" step="NameOfStep2" result="PASS">
Stuff I dont't care about
</step>
</test>
For fast analysis I want to get the name and the result of the steps which are the children of the root element. Stuff I dont't care about are lots of nested elements.
I have already tried following:
tree = ET.parse(xmlLocation)
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
Here I get a memory error because the files are to big
Then I tried:
try:
for event, elem in ET.iterparse(pathToSteps, events=("start","end")):
if elem.tag == "step" and event == "start":
stepAndResult.append([elem.attrib['step'],elem.attrib['result'],"System1"])
elem.clear()
This works but is really slow. I guess it iterates through all elements and this takes a very long time.
Then I found a solution looking like this:
try:
tree = ET.iterparse(pathToSteps, events=("start","end"))
_, root = next(tree)
print('ROOT:', root.tag)
except:
print("ERROR: Unable to open and parse file !!!")
for child in root:
print(child.attrib)
But this prints only the attributes of the first step.
Is there a way to speed up the working solution? Since I'm pretty new to this stuff I would appreciate a complete example or a reference where I can figure it out by myself with an example.
I think you're on the right track with iterparse()
.
Maybe try specifying the step
element name in the tag
argument and only processing "start" events...
from lxml import etree
for event, elem in etree.iterparse("input.xml", tag="step", events=("start",)):
print(elem.attrib)
elem.clear()
EDIT: For some reason I thought you were using lxml and not ElementTree. My answer would require you to switch to lxml.