Search code examples
pythonelementtreecelementtree

How to obtain the root of a tree without parsing the entire file?


I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.

For example:

Arachni generates an xml report with <arachni_report></arachni_report> as tree root tag.

nmap generates an xml report with <nmaprun></nmaprun> as tree root tag.

I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.

First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.

I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.

By the way, I'm Still learning about xml parsing, thanks in advance.


Solution

  • "simple string methods" are the root [pun intended] of all evil -- see examples below.

    Update 2 Code and output now show that proposed regexes also don't work very well.

    Use ElementTree. The function that you are looking for is iterparse. Enable "start" events. Bale out on the first iteration.

    Code:

    # coding: ascii
    import xml.etree.cElementTree as et
    # import xml.etree.ElementTree as et
    # import lxml.etree as et
    from cStringIO import StringIO
    import re
    
    xml_text_1 = """\
    <?xml version="1.0" ?> 
    <!--  this is a comment --> 
    <root
    ><foo>bar</foo></root
    >
    """
    
    xml_text_2 = """\
    <?xml version="1.0" ?> 
    <!--  this is a comment --> 
    <root
    ><foo>bar</foo></root
    >
    <!--
    That's all, folks! 
    -->
    """
    
    xml_text_3 = '''<?xml version="1.0" ?>
    <!-- <mole1> -->
    <root><foo /></root>
    <!-- </mole2> -->'''
    
    xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''
    
    for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
        print
        chrstr = xml_text.strip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        lastline = chrstr[x:]
        print "*** eyquem 1:", repr(lastline.strip())
    
        chrstr = xml_text.strip()
        x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
        lastline = chrstr[x+1:]
        if lastline[0:5]=='<!-- ':
            chrstr = xml_text[0:x].rstrip()
            x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
            print "*** eyquem 2:", repr(chrstr[x+1:])
        else:
            print "*** eyquem 2:", repr(lastline)
    
        m = None
        for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
            pass
        if m: print "*** eyquem 3:", repr(m.group())
        else: print "*** eyquem 3:", "FAIL"
    
        m = None
        for m in re.finditer('</[^>]+>', xml_text):
            pass
        if m: print "*** eyquem 4:", repr(m.group())
        else: print "*** eyquem 4:", "FAIL"
    
        m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
        if m: print "*** eyquem 5:", repr(m.group())
        else: print "*** eyquem 5:", "FAIL"
    
        m = re.search('<(?![?!])[^>]+>', xml_text)
        if m: print "*** eyquem 6:", repr(m.group())
        else: print "*** eyquem 6:", "FAIL"
    
        filelike_obj = StringIO(xml_text)
        tree = et.parse(filelike_obj)
        print "*** parse:", tree.getroot().tag
    
        filelike_obj = StringIO(xml_text)
        for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
            print "*** iterparse:", elem.tag
            break
    

    Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.

    Output:

    *** eyquem 1: '>'
    *** eyquem 2: '>'
    *** eyquem 3: FAIL
    *** eyquem 4: '</root\n>'
    *** eyquem 5: '<root\n>'
    *** eyquem 6: '<root\n>'
    *** parse: root
    *** iterparse: root
    
    *** eyquem 1: '-->'
    *** eyquem 2: '-->'
    *** eyquem 3: FAIL
    *** eyquem 4: '</root\n>'
    *** eyquem 5: '<root\n>'
    *** eyquem 6: '<root\n>'
    *** parse: root
    *** iterparse: root
    
    *** eyquem 1: '<!-- </mole2> -->'
    *** eyquem 2: '<root><foo /></root>'
    *** eyquem 3: FAIL
    *** eyquem 4: '</mole2>'
    *** eyquem 5: '<root>'
    *** eyquem 6: '<mole1>'
    *** parse: root
    *** iterparse: root
    
    *** eyquem 1: '>'
    *** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
    *** eyquem 3: FAIL
    *** eyquem 4: '</mole2>'
    *** eyquem 5: FAIL
    *** eyquem 6: '<mole1>'
    *** parse: root
    *** iterparse: root
    

    Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.

    try:
        import xml.etree.cElementTree as ET
    except ImportError:
        import cElementTree as ET
    
    def get_root_tag_from_xml_file(xml_file_path):
        result = f = None
        try:
            f = open(xml_file_path, 'rb')
            for event, elem in ET.iterparse(f, ('start', )):
                result = elem.tag
                break
        finally:
            if f: f.close()
        return result
    
    if __name__ == "__main__":
        import sys, glob
        for pattern in sys.argv[1:]:
            for filename in glob.glob(pattern):
                print filename, get_root_tag_from_xml_file(filename)