Search code examples
pythonxmlsaxelementtree

Using Python's xml.etree to find element start and end character offsets


I have XML data that looks like:

<xml>
The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
</xml>

I would like to be able to extract:

  1. The XML elements as they're currently provided in etree.
  2. The full plain text of the document, between the start and end tags.
  3. The location within the plain text of each start element, as a character offset.

(3) is the most important requirement right now; etree provides (1) fine.

I cannot see any way to do (3) directly, but hoped that iterating through the elements in the document tree would return many small string that could be re-assembled, thus providing (2) and (3). However, requesting the .text of the root node only returns text between the root node and the first element, e.g. "The capital of ".

Doing (1) with SAX could involve implementing a lot that's already been written many times over, in e.g. minidom and etree. Using lxml isn't an option for the package that this code is to go into. Can anybody help?


Solution

  • iterparse() function is available in xml.etree:

    import xml.etree.cElementTree as etree
    
    for event, elem in etree.iterparse(file, events=('start', 'end')):
        if event == 'start':
           print(elem.tag) # use only tag name and attributes here
        elif event == 'end':
           # elem children elements, elem.text, elem.tail are available
           if elem.text is not None and elem.tail is not None:
              print(repr(elem.tail))
    

    Another option is to override start(), data(), end() methods of etree.TreeBuilder():

    from xml.etree.ElementTree import XMLParser, TreeBuilder
    
    class MyTreeBuilder(TreeBuilder):
    
        def start(self, tag, attrs):
            print("&lt;%s>" % tag)
            return TreeBuilder.start(self, tag, attrs)
    
        def data(self, data):
            print(repr(data))
            TreeBuilder.data(self, data)
    
        def end(self, tag):
            return TreeBuilder.end(self, tag)
    
    text = """<xml>
    The captial of <place pid="1">South Africa</place> is <place>Pretoria</place>.
    </xml>"""
    
    # ElementTree.fromstring()
    parser = XMLParser(target=MyTreeBuilder())
    parser.feed(text)
    root = parser.close() # return an ordinary Element
    

    Output

    <xml>
    '\nThe captial of '
    <place>
    'South Africa'
    ' is '
    <place>
    'Pretoria'
    '.\n'