Search code examples
pythonxmlmemory-managementxml-parsingiterparse

How to iteratively parse a large XML file in Python?


I need to process an approximately 8Gb large .XML file. The file structure is (simplified) similar to the below:

<TopLevelElement>
    <SomeElementList>
        <Element>zzz</Element>
        ....and so on for thousands of rows
    </SomeElementList>
    <Records>
        <RecordType1>
            <RecordItem id="aaaa">
                <SomeData>
                    <SomeMoreData NameType="xxx">
                        <NameComponent1>zzz</NameComponent1>
                        ....
                        <AnotherNameComponent>zzzz</AnotherNameComponent>
                    </SomeMoreData>
                </SomeData>
            </RecordItem>
        ..... hundreds of thousands of items, some are quite large.
        </RecordType1>
        <RecordType2>
            <RecordItem id="cccc">
            ...hundreds of thousands of RecordType2 elements, slightly different from RecordItems in RecordType1 
            </RecordItem>
        </RecordType2>
    </Records>
</TopLevelElement>

I need to extract some of the sub-elements in RecordType1 and RecordType2 elements. There are conditions to determine which record items need to be processed and which fields need to be extracted. The individual RecordItems do not exceed 120k (some have extensive text data, which I do not need).

Here is the code. Function get_all_records receives following inputs: a) path to the XML file; b) record category ('RecordType1' or 'RecordType2'); c) what name components to pick

from xml.etree import cElementTree as ET

def get_all_records(xml_file_path, record_category, name_types, name_components):
    context = ET.iterparse(xml_file_path, events=("start", "end"))
    context = iter(context)
    event, root = next(context)
    all_records = []
    for event, elem in context:
        if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
            record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
            if record_contents:
                all_records += record_contents
            root.clear()
    return all_records

I have experimented with the number of records, the code nicely processes 100k RecordItems (only Type1, it just takes too long to get to Type2) in approximately one minute. Attempting to process a larger number of records (I took one million), eventually leads to MemoryError in ElementTree.py. So I am guessing no memory is released despite of root.clear() statement.

An ideal solution would be one where the RecordItems would be read one at the time, processed, and then discarded from the memory, but I have no clue how to do that. From XML point of view the two extra layers of elements (TopLevelElement and Records) seem to complicate the task. I am new to XML and to respective Python libraries so an explanation with detail would be much appreciated!


Solution

  • Iterating over a huge XML file is always painful.

    I'll go over all the process from start to finish, suggesting the best practices for keeping low memory yet maximizing parsing speed.

    First no need to store ET.iterparse as a variable. Just iterate over it like

    for event, elem in ET.iterparse(xml_file, events=("start", "end")): This iterator created for, well..., iteration without storing anything else in memory except the current tag. Also you don't need root.clear() with this new approach and you can go as long as your hard disk space allows it for huge XML files.

    Your code should look like:

    from xml.etree import cElementTree as ET
    
    def get_all_records(xml_file_path, record_category, name_types, name_components):
        all_records = []
        for event, elem in ET.iterparse(xml_file_path, events=("start", "end")):
            if event == 'end' and elem.tag == record_category and elem.attrib['action'] != 'del':
                record_contents = get_record(elem, name_types=name_types, name_components=name_components, record_id=elem.attrib['id'])
                if record_contents:
                    all_records += record_contents
        return all_records
    

    Also, please think carefully about the reason you need to store the whole list of all_records. If it's only for writing CSV file at the end of the process - this reason isn't good enough and can cause memory issues when scaling to even bigger XML files.

    Make sure you write each new row to CSV as this row happens, turning memory issues into none-issue.

    P.S.

    If you need to store several tags before you find your main tag in order to parse this historic information as you go down the XML file - just store it locally in some new variables. This comes handy whenever future data in XML file makes you go backwards to a specific tag you know already occured.