Modify large xml file using lxml

Language :- Python 2.7.6

File Size :- 1.5 GB

XML Format

<myfeed>
    <product>
        <id>876543</id>
        <name>ABC</name>
        ....
     </product>

    <product>
        <id>876567</id>
        <name>DEF</name>
        ....
     </product>

    <product>
        <id>986543</id>
        <name>XYZ</name>
        ....
     </product>

I have to

A) Read all the nodes <product>

B) Delete some of these nodes ( if the <id> attribute's text is in python set()

C) Update/Alter few nodes ( if the <id> attribute's text is in python dict

D) Append/Write some new nodes

The problem is my XML file is huge ( approx 1.5 GB ). I did some research and decide to use lxml for all these purposes.

I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory.

for event, element in etree.iterparse(big_xml_file,tag = 'product'):
        for child in element:
            if child.tag == unique_tag:
                if child.text in products_id_hash_set_to_delete: #python set()
                    #delete this element node

                else:
                    if child.text in products_dict_to_update:
                        #update this element node  
                        else:
                            print child.text
        element.clear()

Note:- I want to achieve all these 4 task in one scan of the XML file

Questions

1) Can I achieve all this in one scan of the file ?

2) If yes, how to delete and update the element nodes I am processing?

3) Should I use tree.xpath() instead ? If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse()

I am not very experienced in python. I am from Java background.

Solution

You can't edit an XML file in-place. You have to write the output to a new (temporary) file, and then replace the original file with the new file.

So the basic algorithm is:

Loop over all elements.
If the node is one to delete, proceed to the next element
If the node is one to change, change its value
Write out the node ««« This is the crucial bit you are missing
When you are about to finish processing a node which is a parent of one of the new nodes, write out the new node, and remove it from the collection of new nodes.
Close the output file
Rename.

To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. If you want to insert a character, you have to shuffle all the other ones up; if you want to delete a character, you have to shuffle all the other ones down. You can't do that with a file; you can't just delete a character from the middle of a file.

If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. See BaseX or eXist for some open-source implementations.