How to obtain the root of a tree without parsing the entire file?

I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.

For example:

Arachni generates an xml report with <arachni_report></arachni_report> as tree root tag.

nmap generates an xml report with <nmaprun></nmaprun> as tree root tag.

I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.

First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.

I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.

By the way, I'm Still learning about xml parsing, thanks in advance.

Solution

"simple string methods" are the root [pun intended] of all evil -- see examples below.

Update 2 Code and output now show that proposed regexes also don't work very well.

Use ElementTree. The function that you are looking for is iterparse. Enable "start" events. Bale out on the first iteration.

Code:

# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re

xml_text_1 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
"""

xml_text_2 = """\
<?xml version="1.0" ?> 
<!--  this is a comment --> 
<root
><foo>bar</foo></root
>
<!--
That's all, folks! 
-->
"""

xml_text_3 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo /></root>
<!-- </mole2> -->'''

xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''

for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
    print
    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x:]
    print "*** eyquem 1:", repr(lastline.strip())

    chrstr = xml_text.strip()
    x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
    lastline = chrstr[x+1:]
    if lastline[0:5]=='<!-- ':
        chrstr = xml_text[0:x].rstrip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        print "*** eyquem 2:", repr(chrstr[x+1:])
    else:
        print "*** eyquem 2:", repr(lastline)

    m = None
    for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
        pass
    if m: print "*** eyquem 3:", repr(m.group())
    else: print "*** eyquem 3:", "FAIL"

    m = None
    for m in re.finditer('</[^>]+>', xml_text):
        pass
    if m: print "*** eyquem 4:", repr(m.group())
    else: print "*** eyquem 4:", "FAIL"

    m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
    if m: print "*** eyquem 5:", repr(m.group())
    else: print "*** eyquem 5:", "FAIL"

    m = re.search('<(?![?!])[^>]+>', xml_text)
    if m: print "*** eyquem 6:", repr(m.group())
    else: print "*** eyquem 6:", "FAIL"

    filelike_obj = StringIO(xml_text)
    tree = et.parse(filelike_obj)
    print "*** parse:", tree.getroot().tag

    filelike_obj = StringIO(xml_text)
    for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
        print "*** iterparse:", elem.tag
        break

Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.

Output:

*** eyquem 1: '>'
*** eyquem 2: '>'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '-->'
*** eyquem 2: '-->'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root

*** eyquem 1: '<!-- </mole2> -->'
*** eyquem 2: '<root><foo /></root>'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: '<root>'
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

*** eyquem 1: '>'
*** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: FAIL
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root

Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import cElementTree as ET

def get_root_tag_from_xml_file(xml_file_path):
    result = f = None
    try:
        f = open(xml_file_path, 'rb')
        for event, elem in ET.iterparse(f, ('start', )):
            result = elem.tag
            break
    finally:
        if f: f.close()
    return result

if __name__ == "__main__":
    import sys, glob
    for pattern in sys.argv[1:]:
        for filename in glob.glob(pattern):
            print filename, get_root_tag_from_xml_file(filename)