In answering another question, someone showed me the following tutorial, in which the author claims to have used iterparse to parse a ~100 MB XML file in under 3 seconds:
http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/
I am trying to parse an ~90 MB XML file, and I have the following code:
from xml.etree.cElementTree import *
count = 0
for event, elem in iterparse('foo.xml'):
if elem.tag == 'identifier' and elem.text == 'bar':
count += 1
elem.clear() # discard the element
print count
It is taking about thirty seconds... not even the same order of magnitude as reported in the tutorial I read using a similarly sized file, a similar algorithm, and the same package.
Could someone please inform me what might be wrong with my code, or what differences I might not be noticing between my situation and the tutorial?
I am using Python 2.7.3.
Addendum:
I am also using a reasonably powerful machine, in case anyone thinks that might be it.
As TJD mentioned, comparing XMLs in size only may not be very informative. However, I happen to have files of the same structure but different size:
With a 79M file:
$ python -m timeit -n 1 -c 'from xml.etree.cElementTree import iterparse
count = 0
for event, elem in iterparse("..../QT20060217_S_18mix23-2500_01.mzML"):
if elem.tag.endswith("spectrum"): count += 1
elem.clear()
print count'
6126
6126
6126
1 loops, best of 3: 950 msec per loop
With a 3.8G file the timeit
output is:
1 loops, best of 3: 22.3 sec per loop
Also, compare with lxml
: changing xml.etree.cElementTree
in the first line to lxml.etree
I get:
for the first file: 730 msec per loop
for the second file: 11.4 sec per loop