I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.
You can see a sample of the data hereWarning this will initiate a download of a 108MB zip file!. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:
from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print (full_filename)
f = open(full_filename, "r")
xml = f.read()
soup = BeautifulSoup(xml)
del xml
del soup
f.close()
If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.
I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.
Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.
Try using cElementTree.parse()
from Python's standard xml
library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.
Like this:
import xml.etree.cElementTree as cET
# ...
def rename_xml_files(directory):
xml_files = [xml_file for xml_file in os.listdir(directory) ]
for filename in xml_files:
filename = filename.strip()
full_filename = directory + "/" +filename
print(full_filename)
parsed = cET.parse(full_filename)
del parsed
If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.