Language :- Python 2.7.6
File Size :- 1.5 GB
XML Format
<myfeed>
<product>
<id>876543</id>
<name>ABC</name>
....
</product>
<product>
<id>876567</id>
<name>DEF</name>
....
</product>
<product>
<id>986543</id>
<name>XYZ</name>
....
</product>
I have to
A) Read all the nodes <product>
B) Delete some of these nodes ( if the <id>
attribute's text is in python set()
C) Update/Alter few nodes ( if the <id>
attribute's text is in python dict
D) Append/Write some new nodes
The problem is my XML file is huge ( approx 1.5 GB ). I did some research and decide to use lxml for all these purposes.
I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory.
for event, element in etree.iterparse(big_xml_file,tag = 'product'):
for child in element:
if child.tag == unique_tag:
if child.text in products_id_hash_set_to_delete: #python set()
#delete this element node
else:
if child.text in products_dict_to_update:
#update this element node
else:
print child.text
element.clear()
Note:- I want to achieve all these 4 task in one scan of the XML file
Questions
1) Can I achieve all this in one scan of the file ?
2) If yes, how to delete and update the element nodes I am processing?
3) Should I use tree.xpath() instead ? If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse()
I am not very experienced in python. I am from Java background.
You can't edit an XML file in-place. You have to write the output to a new (temporary) file, and then replace the original file with the new file.
So the basic algorithm is:
To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. If you want to insert a character, you have to shuffle all the other ones up; if you want to delete a character, you have to shuffle all the other ones down. You can't do that with a file; you can't just delete a character from the middle of a file.
If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. See BaseX or eXist for some open-source implementations.