Search code examples
pythonxmlelementtreecelementtree

ElementTree module to separate xml contents


I am trying to parse a xml file and arrange it into a table separating the contents as isElement, isAttribute, Value, Text.

How do I use ElementTree module to achieve this? I know this is possible using the minidom module.

The reason I want to use ElementTree is due to is effencicy. An exmaple of what I am trying to achive is available here: http://python.zirael.org/e-gtk-treeview4.html

Any advice on how to seprate the xml contents into element, subelemnt etc. using the ElementTree module?

This is what I have so far:

import xml.etree.cElementTree as ET

filetree = ET.ElementTree(file = "some_file.xml")
for child in filetree.iter():
     print child.tag, child.text, child.attrib

For the following example xml file:

    <?xml version="1.0"?>
    <data>
        <country name="Liechtenstein">
            <rank>1</rank>
            <year>2008</year>
            <gdppc>141100</gdppc>
            <neighbor name="Austria" direction="E"/>
            <neighbor name="Switzerland" direction="W"/>
        </country>
        <country name="Singapore">
            <rank>4</rank>
            <year>2011</year>
            <gdppc>59900</gdppc>
            <neighbor name="Malaysia" direction="N"/>
        </country>
        <country name="Panama">
            <rank>68</rank>
            <year>2011</year>
            <gdppc>13600</gdppc>
            <neighbor name="Costa Rica" direction="W"/>
            <neighbor name="Colombia" direction="E"/>
        </country>
    </data>

I get this as output:

    data 
         {}
    country 
             {'name': 'Liechtenstein'}
    rank 1 {}
    year 2008 {}
    gdppc 141100 {}
    neighbor None {'direction': 'E', 'name': 'Austria'}
    neighbor None {'direction': 'W', 'name': 'Switzerland'}
    country 
             {'name': 'Singapore'}
    rank 4 {}
    year 2011 {}
    gdppc 59900 {}
    neighbor None {'direction': 'N', 'name': 'Malaysia'}
    country 
             {'name': 'Panama'}
    rank 68 {}
    year 2011 {}
    gdppc 13600 {}
    neighbor None {'direction': 'W', 'name': 'Costa Rica'}
    neighbor None {'direction': 'E', 'name': 'Colombia'}

I did find something simialr on another post but it uses the DOM module. Walk through all XML nodes in an element-nested structure

Based on the comment received, this is what I want to achieve:

    data (type Element)
         country(Element)
              Text = None
              name(Attribute)
                 value: Liechtenstein
              rank(Element)
                  Text = 1
              year(Element)
                  Text = 2008
              gdppc(Element)
                  Text = 141100
              neighbour(Element)
                  name(Attribute)
                      value: Austria
                  direction(Attribute)
                      value: E
              neighbour(Element)
                  name(Attribute)
                      value: Switzerland
                  direction(Attribute)
                      value: W

         country(Element)
              Text = None
              name(Attribute)
                 value: Singapore
              rank(Element)
                  Text = 4

I want to be able to presente my data in a tree like structure as above. To do this I need to keeep track of their relationship. Hope this clarifies the question.


Solution

  • Element objects are sequences containing their direct child elements. XML attributes are stored in a dictionary mapping attribute names to values. There are no text nodes as in DOM. Text ist stored as text and tail attributes. Text within the element but before the first subelement is stored in text and text between that element and the next one is stored in tail. So if we take the gtk-treeview4-2.py example from TreeView IV. - display of trees we have to rewrite this DOM code:

    # ...
    import xml.dom.minidom as dom
    # ...
    
        def create_interior(self):
            # ...
            doc = dom.parse(self.filename)
            self.add_element_to_treestore(doc.childNodes[0], None)
            # ...
    
        def add_element_to_treestore(self, e, parent):
            if isinstance(e, dom.Element):
                me = self.model.append(parent, [e.nodeName, 'ELEMENT', ''])
                for i in range(e.attributes.length):
                    a = e.attributes.item(i)
                    self.model.append(me, ['@' + a.name, 'ATTRIBUTE', a.value])
                for ch in e.childNodes:
                    self.add_element_to_treestore(ch, me)
            elif isinstance(e, dom.Text):
                self.model.append(
                    parent, ['text()', 'TEXT_NODE', e.nodeValue.strip()])
    

    by the following using ElementTree:

    # ...
    from xml.etree import ElementTree as etree
    # ...
    
        def create_interior(self):
            # ...
            doc = etree.parse(self.filename)
            self.add_element_to_treestore(doc.getroot())
            # ...
    
        def add_element_to_treestore(self, element, parent=None):
            path = self.model.append(parent, [element.tag, 'ELEMENT', ''])
            for name, value in sorted(element.attrib.iteritems()):
                self.model.append(path, ['@' + name, 'ATTRIBUTE', value])
            if element.text:
                self.model.append(
                    path, ['text()', 'TEXT_NODE', element.text.strip()]
                )
            for child in element:
                self.add_element_to_treestore(child, path)
                if element.tail:
                    self.model.append(
                        path, ['text()', 'TEXT_NODE', element.tail.strip()]
                    )
    

    Screenshot with your example data and the first subtree fully expanded:

    Screenshot of exampla data


    Update: Screenshot of example data and relevant import lines in code added.