Reading the large StackOverflow XML dump file (Posts.xml
~90 GB
) through the following approach
from xml.etree.cElementTree import iterparse
for evt, elem in iterparse("Posts.xml", events=('end',)):
if elem.tag == 'row':
user_fields = elem.attrib
cause OOM just iterating over the XML elements (without any memory allocation), even on a 128 GB RAM computer environment.
Since I did not get any info from documentation or other examples in the StackOverflow community, could you help me figure out how to work around it?
Based on Daniel Haley's comments, you could try:
from lxml.etree import iterparse # replace xml to lxml
for evt, elem in iterparse("Posts.xml", events=('end',), tag="row"):
user_fields = elem.attrib
...
elem.clear()