OOM when using iterparse on huge XML dump file

Reading the large StackOverflow XML dump file (Posts.xml ~90 GB) through the following approach

from xml.etree.cElementTree import iterparse

for evt, elem in iterparse("Posts.xml", events=('end',)):
    if elem.tag == 'row':
        user_fields = elem.attrib

cause OOM just iterating over the XML elements (without any memory allocation), even on a 128 GB RAM computer environment.

Since I did not get any info from documentation or other examples in the StackOverflow community, could you help me figure out how to work around it?

Solution

Based on Daniel Haley's comments, you could try:

from lxml.etree import iterparse # replace xml to lxml

for evt, elem in iterparse("Posts.xml", events=('end',), tag="row"):
    user_fields = elem.attrib
    ...
    elem.clear()