This has been driving me nuts all day and i would appreciate a bit of help with parsing a large XML file ...
files contains over 900,000 lines and is downloaded in gzip format, i did have something working using an extract of the data for testing and parsing it with minidom, but thats just not going to cut it for the full file, so I'm looking at iterparse, but i just can't get any of the examples to work, even to the point where I'm getting unable to import errors .... the only import i can get to work is import xml.eTree.cElementTree but that barely seems to work with most of the code examples i have found
i did have one thing getting close with iterparse and cElementTree
def buildit(file):
print file
#with open(file) as line:
#print line
for event, elem in et.iterparse(file):
with open(file, "r") as line:
for event, elem in et.iterparse(file):
print elem.tag
if event =='end' and elem.tag=='Journey':
print elem.tag
time.sleep(0.5)
elm.clear
but this prints out the following
{http://www.website.com/ixid/xmlfile/v8}Journey
{http://www.website.com/ixid/xmlfile/v8}OR
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP
notice how its putting something form the top element into each item ??? anyway ... sample xml below ... and thats in advance for any help
<?xml version="1.0" encoding="utf-8"?>
<PportTimetable xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" timetableID="20160421020832" xmlns="http://www.website.com/ixid/xmlfile/v8">
<Journey rid="201604211191598" uid="G61365" trainId="1T02" ssd="2016-04-21" toc="SR" trainCat="XX">
<OR tpl="PERTH" act="TBK " plat="3" ptd="05:18" wtd="05:18" />
<PP tpl="HILTONJ" wtp="05:22" />
<IP tpl="GLNEGLS" act="T " plat="1" pta="05:33" ptd="05:33" wta="05:32:30" wtd="05:33:30" />
<PP tpl="BLFD" wtp="05:37:30" />
<IP tpl="DUNANE" act="T " plat="1" pta="05:45" ptd="05:46" wta="05:45" wtd="05:46" />
<IP tpl="BGOALAN" act="T " plat="1" pta="05:49" ptd="05:49" wta="05:49" wtd="05:49:30" />
<IP tpl="STIRLNG" act="T K " plat="3" pta="05:53" ptd="05:54" wta="05:53" wtd="05:54" />
<IP tpl="LARBERT" act="T " plat="1" pta="06:03" ptd="06:03" wta="06:02:30" wtd="06:03" />
<PP tpl="LARBERJ" wtp="06:04:30" />
<PP tpl="CRMRSWJ" wtp="06:05" />
<PP tpl="GNHLLJN" wtp="06:09" />
<OPIP tpl="CMBRNLD" act="C N " plat="1" wta="06:22" wtd="06:24" />
<PP tpl="GRNQNNJ" wtp="06:30" />
<PP tpl="GSHRSJN" wtp="06:33" />
<PP tpl="COATBDC" wtp="06:36:30" />
<PP tpl="LGLNJN" wtp="06:38" />
<PP tpl="CARMYLE" plat="1" wtp="06:49" />
<PP tpl="RTHGNEJ" wtp="06:53:30" />
<PP tpl="SHFD" wtp="06:56" />
<PP tpl="LRKFLDJ" wtp="06:59" />
<PP tpl="EGLNSTJ" wtp="07:01:30" />
<PP tpl="GLGCBSJ" wtp="07:02:30" />
<DT tpl="GLGC" act="TF" pta="07:05" wta="07:05" />
</Journey>
<Journey rid="201604211192476" uid="G64015" trainId="2N41" ssd="2016-04-21" toc="SR">
<OR tpl="GLGQLL" act="TB" plat="8" ptd="06:20" wtd="06:20" />
<PP tpl="FNSTNEJ" wtp="06:23:30" />
<PP tpl="HYNDLEJ" wtp="06:28:30" />
<OPIP tpl="ANSL" act="A N " plat="2" wta="06:30" wtd="06:30:30" />
<PP tpl="MRYHILL" wtp="06:33" />
<PP tpl="CWLRSNJ" wtp="06:48" />
<PP tpl="CWLRSEJ" wtp="06:49" />
<IP tpl="BSHB" act="T " plat="1" pta="06:52" ptd="06:54" wta="06:52" wtd="06:54" />
<IP tpl="LENZIE" act="T " plat="1" pta="06:59" ptd="06:59" wta="06:58:30" wtd="06:59:30" />
<IP tpl="CROY" act="T " plat="1" pta="07:06" ptd="07:06" wta="07:05:30" wtd="07:06:30" />
<PP tpl="GNHLUJN" wtp="07:12:30" />
<PP tpl="GNHLLJN" wtp="07:15" />
<PP tpl="CRMRSWJ" wtp="07:17" />
<PP tpl="LARBERJ" wtp="07:19:30" />
<IP tpl="LARBERT" act="T " plat="2" pta="07:21" ptd="07:21" wta="07:20:30" wtd="07:21" />
<IP tpl="STIRLNG" act="T " plat="6" pta="07:30" ptd="07:41" wta="07:29:30" wtd="07:41" />
<IP tpl="BGOALAN" act="T " plat="2" pta="07:45" ptd="07:45" wta="07:45" wtd="07:45:30" />
<DT tpl="DUNANE" act="TF" plat="DPV" pta="07:52" wta="07:52" />
</Journey>
</PportTimetable>
Here is a working program that illustrates how to use .iterparse()
from cElementTree
, storing the results in a database. Note that this program is aware of the namespace used in the input XML.
The i.xml
is identical to the example XML given in the question.
# Tested on Python 2.6.7, Ubuntu 14.04.4
import xml.etree.cElementTree as et
import sqlite3
# Tools to deal with namespaces
ixid_uri = 'http://www.website.com/ixid/xmlfile/v8'
def extract_local_tag(qname):
return qname.split('}')[-1]
# A db connection to illustrate the example
conn = sqlite3.connect(":memory:")
c = conn.cursor()
c.execute("create table foo (joury_uid text, tag text, tpl text)")
conn.commit()
# The main part of the code: iterate over the XML,
# storing DB stuff at the end of every <Journey>
with open('i.xml') as xml_file:
for event, elem in et.iterparse(xml_file):
# Must compare tag to qualified name
if elem.tag == et.QName(ixid_uri, 'Journey'):
c.executemany('insert into foo values(?, ?, ?)',
[
(elem.attrib['uid'],
extract_local_tag(child.tag),
child.attrib.get('tpl', None))
for child in elem
])
conn.commit()
# Note: only clears <Journey> elements and their children.
# There is a memory leak of any elements not children of <Journey>
elem.clear()
for row in c.execute('select * from foo'):
print row
Result:
(u'G61365', u'OR', u'PERTH')
(u'G61365', u'PP', u'HILTONJ')
...
(u'G61365', u'DT', u'GLGC')
(u'G64015', u'OR', u'GLGQLL')
(u'G64015', u'PP', u'FNSTNEJ')
...
References: