Search code examples
pythonxmlparsingminidom

Using minidom to break down an xml file into managable files. Python


I am currently trying to use python to create 3 files from one xml file.

There are three types of data in the xml file, estates, symbol names and tick types.

I want 3 text files, listing the 3 different things.

This is currently my code, and it lists the estates absolutely fine:

    from xml.dom import minidom


#Define the xmldoc object
xmldoc = minidom.parse('C:\\Temp\\Symbols.xml')

#Define EstateList by getting Elements by tag name
EstateList = xmldoc.getElementsByTagName('Estate')
#Print Estate List
print "There are currently %d data estates" % len(EstateList)
#print EstateList[0].attributes['EstateName'].value
for s in EstateList:
    print s.attributes['EstateName'].value

#Save Estate List to file
with open('dataestates.txt', 'w') as f:
    f.write("There are currently %d data estates \n" % len(EstateList))
    for s in EstateList:
        f.write(s.attributes['EstateName'].value + "\n")

However, when I start looking at the other ones, symbol names and tick types I can't get anything to work, I can't get close to it listing tick types, I've tried attributes, tags, all sorts.

Here is an example of the xml code

   <Estates>
     <Estate EstateName="BBG.DL.BOND.RAW._LIVE">
       <Ticktype>BBG_BGN</Ticktype>
       <Ticktype>BBG_BVAL</Ticktype>
       <Ticktype>BBG_CBBT</Ticktype>
       <Ticktype>BBG_IXEP</Ticktype>
       <Ticktype>BBG_IXSP</Ticktype>
       <Ticktype>BBG_TRAC</Ticktype>
       <Ticktype>BBG</Ticktype>
     </Estate>
     <Estate EstateName="BBG.DL.CCY.RAW._LIVE">
       <Ticktype>BBG</Ticktype>
     </Estate>
</Estates>
<Symbols>
     <Symbol SymbolName="AT0000386073 Corp" Estate="BBG.DL.BOND.RAW._LIVE" TickType="BBG_BGN" />
     <Symbol SymbolName="AT0000386073 Corp" Estate="BBG.DL.BOND.RAW._LIVE" TickType="BBG_BVAL" />
</Symbols>

Solution

  • 1. Ticks

    The interior text of a <Ticktype> element is stored in a child node. To access the text, you must find that child. Node.firstChild should do it for you. Once you have found the child node, you can get the text through the Text.data attribute.

    Thus, given a <Ticktype> element, you can find the text as: .firstChild.data.

    ticklist = xmldoc.getElementsByTagName('Ticktype')
    print "There are currently %d tick types" % len(ticklist)
    for s in ticklist:
        print s.firstChild.data
    

    The tick types appear to have duplicate values. You can reduce them to a unique list by using a set:

    tickset = set(s.firstChild.data for s in ticklist)
    print "There are %d unique tick types" % len(tickset)
    for s in tickset:
        print s
    

    2. Symbols

    Symbols are stored nearly identically to how estates are. Thus, they are extracted similarly to how estates are extracted:

    symlist = xmldoc.getElementsByTagName('Symbol')
    print "There are currently %d symbols" % len(symlist)
    for s in symlist:
        print s.attributes['SymbolName'].value