Search code examples
pythonxmliterparsecelementtree

How long should ElementTree iterparse take?


In answering another question, someone showed me the following tutorial, in which the author claims to have used iterparse to parse a ~100 MB XML file in under 3 seconds:

http://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/

I am trying to parse an ~90 MB XML file, and I have the following code:

from xml.etree.cElementTree import *
count = 0

for event, elem in iterparse('foo.xml'):        
    if elem.tag == 'identifier' and elem.text == 'bar':
        count += 1
    elem.clear() # discard the element

print count

It is taking about thirty seconds... not even the same order of magnitude as reported in the tutorial I read using a similarly sized file, a similar algorithm, and the same package.

Could someone please inform me what might be wrong with my code, or what differences I might not be noticing between my situation and the tutorial?

I am using Python 2.7.3.

Addendum:

I am also using a reasonably powerful machine, in case anyone thinks that might be it.


Solution

  • As TJD mentioned, comparing XMLs in size only may not be very informative. However, I happen to have files of the same structure but different size:

    With a 79M file:

    $ python -m timeit -n 1 -c 'from xml.etree.cElementTree import iterparse
    count = 0
    for event, elem in iterparse("..../QT20060217_S_18mix23-2500_01.mzML"):
        if elem.tag.endswith("spectrum"): count += 1
        elem.clear()
    print count'
    6126
    6126
    6126
    1 loops, best of 3: 950 msec per loop
    

    With a 3.8G file the timeit output is:

    1 loops, best of 3: 22.3 sec per loop
    

    Also, compare with lxml: changing xml.etree.cElementTree in the first line to lxml.etree I get:

    for the first file: 730 msec per loop

    for the second file: 11.4 sec per loop