Search code examples
pythonxmlparsingsaxelementtree

memory efficient way to change and parse a large XML file in python


I want to parse a large XML file (25 GB) in python, and change some of its elements.

I tried ElementTree from xml.etree but it takes too much time at the first step (ElementTree.parse).

I read somewhere that SAX is fast and do not load the entire file into the memory but it just for parsing not modifying.

'iterparse' should also be just for parsing not modifying.

Is there any other option which is fast and memory efficient?


Solution

  • What is important for you here is that you need a streaming parser, which is what sax is. (There is a built in sax implementation in python and lxml provides one.) The problem is that since you are trying to modify the xml file, you will have to rewrite the xml file as you read it.

    An XML file is a text file, You can't go and change some data in the middle of the text file without rewriting the entire text file (unless the data is the exact same size which is unlikely)

    You can use SAX to read in each element and register an event to write back each element after it is been read and modified. If your changes are really simple it may be even faster to not even bother with the XML parsing and just match text for what you are looking for.

    If you are doing any signinficant work with this large of an XML file, then I would say you shouldn't be using an XML file, you should be using a database.

    The problem you have run into here is the same issue that Cobol programmers on mainframes had when they were working with File based data