Search code examples
pythonxmllxmllarge-fileselementtree

Using Python Iterparse For Large XML Files


I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB). I wanted to use iterparse in lxml to do it.

My file is of the format:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

and so far my solution is:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

Unfortunately though, this solution is still eating up a lot of memory. I think the problem is that after dealing with each "ITEM" I need to do something to cleanup empty children. Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?


Solution

  • Try Liza Daly's fast_iter. After processing an element, elem, it calls elem.clear() to remove descendants and also removes preceding siblings.

    def fast_iter(context, func, *args, **kwargs):
        """
        http://lxml.de/parsing.html#modifying-the-tree
        Based on Liza Daly's fast_iter
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                while ancestor.getprevious() is not None:
                    del ancestor.getparent()[0]
        del context
    
    
    def process_element(elem):
        print elem.xpath( 'description/text( )' )
    
    context = etree.iterparse( MYFILE, tag='item' )
    fast_iter(context,process_element)
    

    Daly's article is an excellent read, especially if you are processing large XML files.


    Edit: The fast_iter posted above is a modified version of Daly's fast_iter. After processing an element, it is more aggressive at removing other elements that are no longer needed.

    The script below shows the difference in behavior. Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.

    import lxml.etree as ET
    import textwrap
    import io
    
    def setup_ABC():
        content = textwrap.dedent('''\
          <root>
            <A1>
              <B1></B1>
              <C>1<D1></D1></C>
              <E1></E1>
            </A1>
            <A2>
              <B2></B2>
              <C>2<D></D></C>
              <E2></E2>
            </A2>
          </root>
            ''')
        return content
    
    
    def study_fast_iter():
        def orig_fast_iter(context, func, *args, **kwargs):
            for event, elem in context:
                print('Processing {e}'.format(e=ET.tostring(elem)))
                func(elem, *args, **kwargs)
                print('Clearing {e}'.format(e=ET.tostring(elem)))
                elem.clear()
                while elem.getprevious() is not None:
                    print('Deleting {p}'.format(
                        p=(elem.getparent()[0]).tag))
                    del elem.getparent()[0]
            del context
    
        def mod_fast_iter(context, func, *args, **kwargs):
            """
            http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
            Author: Liza Daly
            See also http://effbot.org/zone/element-iterparse.htm
            """
            for event, elem in context:
                print('Processing {e}'.format(e=ET.tostring(elem)))
                func(elem, *args, **kwargs)
                # It's safe to call clear() here because no descendants will be
                # accessed
                print('Clearing {e}'.format(e=ET.tostring(elem)))
                elem.clear()
                # Also eliminate now-empty references from the root node to elem
                for ancestor in elem.xpath('ancestor-or-self::*'):
                    print('Checking ancestor: {a}'.format(a=ancestor.tag))
                    while ancestor.getprevious() is not None:
                        print(
                            'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                        del ancestor.getparent()[0]
            del context
    
        content = setup_ABC()
        context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
        orig_fast_iter(context, lambda elem: None)
        # Processing <C>1<D1/></C>
        # Clearing <C>1<D1/></C>
        # Deleting B1
        # Processing <C>2<D/></C>
        # Clearing <C>2<D/></C>
        # Deleting B2
    
        print('-' * 80)
        """
        The improved fast_iter deletes A1. The original fast_iter does not.
        """
        content = setup_ABC()
        context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
        mod_fast_iter(context, lambda elem: None)
        # Processing <C>1<D1/></C>
        # Clearing <C>1<D1/></C>
        # Checking ancestor: root
        # Checking ancestor: A1
        # Checking ancestor: C
        # Deleting B1
        # Processing <C>2<D/></C>
        # Clearing <C>2<D/></C>
        # Checking ancestor: root
        # Checking ancestor: A2
        # Deleting A1
        # Checking ancestor: C
        # Deleting B2
    
    study_fast_iter()