Search code examples
pythonxmlraspberry-pi2iterparse

iterparse large XML using python


This has been driving me nuts all day and i would appreciate a bit of help with parsing a large XML file ...

files contains over 900,000 lines and is downloaded in gzip format, i did have something working using an extract of the data for testing and parsing it with minidom, but thats just not going to cut it for the full file, so I'm looking at iterparse, but i just can't get any of the examples to work, even to the point where I'm getting unable to import errors .... the only import i can get to work is import xml.eTree.cElementTree but that barely seems to work with most of the code examples i have found

i did have one thing getting close with iterparse and cElementTree

def buildit(file):
        print file
        #with open(file) as line:
        #print line
        for  event, elem in et.iterparse(file):
                with open(file, "r") as line:
                        for event, elem in et.iterparse(file):
                                print elem.tag
                                if event =='end' and elem.tag=='Journey':
                                        print elem.tag
                                        time.sleep(0.5)
                                        elm.clear

but this prints out the following

{http://www.website.com/ixid/xmlfile/v8}Journey
{http://www.website.com/ixid/xmlfile/v8}OR
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP
{http://www.website.com/ixid/xmlfile/v8}PP
{http://www.website.com/ixid/xmlfile/v8}IP

notice how its putting something form the top element into each item ??? anyway ... sample xml below ... and thats in advance for any help

<?xml version="1.0" encoding="utf-8"?>
<PportTimetable xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" timetableID="20160421020832" xmlns="http://www.website.com/ixid/xmlfile/v8">
  <Journey rid="201604211191598" uid="G61365" trainId="1T02" ssd="2016-04-21" toc="SR" trainCat="XX">
    <OR tpl="PERTH" act="TBK " plat="3" ptd="05:18" wtd="05:18" />
    <PP tpl="HILTONJ" wtp="05:22" />
    <IP tpl="GLNEGLS" act="T " plat="1" pta="05:33" ptd="05:33" wta="05:32:30" wtd="05:33:30" />
    <PP tpl="BLFD" wtp="05:37:30" />
    <IP tpl="DUNANE" act="T " plat="1" pta="05:45" ptd="05:46" wta="05:45" wtd="05:46" />
    <IP tpl="BGOALAN" act="T " plat="1" pta="05:49" ptd="05:49" wta="05:49" wtd="05:49:30" />
    <IP tpl="STIRLNG" act="T K " plat="3" pta="05:53" ptd="05:54" wta="05:53" wtd="05:54" />
    <IP tpl="LARBERT" act="T " plat="1" pta="06:03" ptd="06:03" wta="06:02:30" wtd="06:03" />
    <PP tpl="LARBERJ" wtp="06:04:30" />
    <PP tpl="CRMRSWJ" wtp="06:05" />
    <PP tpl="GNHLLJN" wtp="06:09" />
    <OPIP tpl="CMBRNLD" act="C N " plat="1" wta="06:22" wtd="06:24" />
    <PP tpl="GRNQNNJ" wtp="06:30" />
    <PP tpl="GSHRSJN" wtp="06:33" />
    <PP tpl="COATBDC" wtp="06:36:30" />
    <PP tpl="LGLNJN" wtp="06:38" />
    <PP tpl="CARMYLE" plat="1" wtp="06:49" />
    <PP tpl="RTHGNEJ" wtp="06:53:30" />
    <PP tpl="SHFD" wtp="06:56" />
    <PP tpl="LRKFLDJ" wtp="06:59" />
    <PP tpl="EGLNSTJ" wtp="07:01:30" />
    <PP tpl="GLGCBSJ" wtp="07:02:30" />
    <DT tpl="GLGC" act="TF" pta="07:05" wta="07:05" />
  </Journey>
  <Journey rid="201604211192476" uid="G64015" trainId="2N41" ssd="2016-04-21" toc="SR">
    <OR tpl="GLGQLL" act="TB" plat="8" ptd="06:20" wtd="06:20" />
    <PP tpl="FNSTNEJ" wtp="06:23:30" />
    <PP tpl="HYNDLEJ" wtp="06:28:30" />
    <OPIP tpl="ANSL" act="A N " plat="2" wta="06:30" wtd="06:30:30" />
    <PP tpl="MRYHILL" wtp="06:33" />
    <PP tpl="CWLRSNJ" wtp="06:48" />
    <PP tpl="CWLRSEJ" wtp="06:49" />
    <IP tpl="BSHB" act="T " plat="1" pta="06:52" ptd="06:54" wta="06:52" wtd="06:54" />
    <IP tpl="LENZIE" act="T " plat="1" pta="06:59" ptd="06:59" wta="06:58:30" wtd="06:59:30" />
    <IP tpl="CROY" act="T " plat="1" pta="07:06" ptd="07:06" wta="07:05:30" wtd="07:06:30" />
    <PP tpl="GNHLUJN" wtp="07:12:30" />
    <PP tpl="GNHLLJN" wtp="07:15" />
    <PP tpl="CRMRSWJ" wtp="07:17" />
    <PP tpl="LARBERJ" wtp="07:19:30" />
    <IP tpl="LARBERT" act="T " plat="2" pta="07:21" ptd="07:21" wta="07:20:30" wtd="07:21" />
    <IP tpl="STIRLNG" act="T " plat="6" pta="07:30" ptd="07:41" wta="07:29:30" wtd="07:41" />
    <IP tpl="BGOALAN" act="T " plat="2" pta="07:45" ptd="07:45" wta="07:45" wtd="07:45:30" />
    <DT tpl="DUNANE" act="TF" plat="DPV" pta="07:52" wta="07:52" />
  </Journey>
</PportTimetable>

Solution

  • Here is a working program that illustrates how to use .iterparse() from cElementTree, storing the results in a database. Note that this program is aware of the namespace used in the input XML.

    The i.xml is identical to the example XML given in the question.

    # Tested on Python 2.6.7, Ubuntu 14.04.4
    import xml.etree.cElementTree as et
    import sqlite3
    
    # Tools to deal with namespaces
    ixid_uri = 'http://www.website.com/ixid/xmlfile/v8'
    def extract_local_tag(qname):
        return qname.split('}')[-1]
    
    # A db connection to illustrate the example
    conn = sqlite3.connect(":memory:")
    c = conn.cursor()
    c.execute("create table foo (joury_uid text, tag text, tpl text)")
    conn.commit()
    
    # The main part of the code: iterate over the XML,
    # storing DB stuff at the end of every <Journey>
    with open('i.xml') as xml_file:
        for event, elem in et.iterparse(xml_file):
            # Must compare tag to qualified name
            if elem.tag == et.QName(ixid_uri, 'Journey'):
                c.executemany('insert into foo values(?, ?, ?)',
                    [
                        (elem.attrib['uid'],
                        extract_local_tag(child.tag),
                        child.attrib.get('tpl', None))
                        for child in elem
                    ])
                conn.commit()
                # Note: only clears <Journey> elements and their children.
                # There is a memory leak of any elements not children of <Journey>
                elem.clear()    
    for row in c.execute('select * from foo'):
        print row
    

    Result:

    (u'G61365', u'OR', u'PERTH')
    (u'G61365', u'PP', u'HILTONJ')
    ...
    (u'G61365', u'DT', u'GLGC')
    (u'G64015', u'OR', u'GLGQLL')
    (u'G64015', u'PP', u'FNSTNEJ')
    ...
    

    References: