I'm making an xml parser to parse xml reports from different tools, and each tool generates different reports with different tags.
For example:
Arachni generates an xml report with <arachni_report></arachni_report>
as tree root tag.
nmap generates an xml report with <nmaprun></nmaprun>
as tree root tag.
I'm trying not to parse the entire file unless it's a valid report from any of the tools I want.
First thing I thought to use was ElementTree, parse the entire xml file (supposing it contains valid xml), and then check based on the tree root if the report belongs to Arachni or nmap.
I'm currently using cElementTree, and as far as I know getroot() is not an option here, but my goal is to make this parser to operate with recognized files only, without parsing unnecessary files.
By the way, I'm Still learning about xml parsing, thanks in advance.
"simple string methods" are the root [pun intended] of all evil -- see examples below.
Update 2 Code and output now show that proposed regexes also don't work very well.
Use ElementTree. The function that you are looking for is iterparse
. Enable "start" events. Bale out on the first iteration.
Code:
# coding: ascii
import xml.etree.cElementTree as et
# import xml.etree.ElementTree as et
# import lxml.etree as et
from cStringIO import StringIO
import re
xml_text_1 = """\
<?xml version="1.0" ?>
<!-- this is a comment -->
<root
><foo>bar</foo></root
>
"""
xml_text_2 = """\
<?xml version="1.0" ?>
<!-- this is a comment -->
<root
><foo>bar</foo></root
>
<!--
That's all, folks!
-->
"""
xml_text_3 = '''<?xml version="1.0" ?>
<!-- <mole1> -->
<root><foo /></root>
<!-- </mole2> -->'''
xml_text_4 = '''<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'''
for xml_text in (xml_text_1, xml_text_2, xml_text_3, xml_text_4):
print
chrstr = xml_text.strip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
lastline = chrstr[x:]
print "*** eyquem 1:", repr(lastline.strip())
chrstr = xml_text.strip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
lastline = chrstr[x+1:]
if lastline[0:5]=='<!-- ':
chrstr = xml_text[0:x].rstrip()
x = max(chrstr.rfind('\r'),chrstr.rfind('\n'))
print "*** eyquem 2:", repr(chrstr[x+1:])
else:
print "*** eyquem 2:", repr(lastline)
m = None
for m in re.finditer('^</[^>]+>', xml_text, re.MULTILINE):
pass
if m: print "*** eyquem 3:", repr(m.group())
else: print "*** eyquem 3:", "FAIL"
m = None
for m in re.finditer('</[^>]+>', xml_text):
pass
if m: print "*** eyquem 4:", repr(m.group())
else: print "*** eyquem 4:", "FAIL"
m = re.search('^<(?![?!])[^>]+>', xml_text, re.MULTILINE)
if m: print "*** eyquem 5:", repr(m.group())
else: print "*** eyquem 5:", "FAIL"
m = re.search('<(?![?!])[^>]+>', xml_text)
if m: print "*** eyquem 6:", repr(m.group())
else: print "*** eyquem 6:", "FAIL"
filelike_obj = StringIO(xml_text)
tree = et.parse(filelike_obj)
print "*** parse:", tree.getroot().tag
filelike_obj = StringIO(xml_text)
for event, elem in et.iterparse(filelike_obj, ('start', 'end')):
print "*** iterparse:", elem.tag
break
Above ElementTree-related code works with Python 2.5 to 2.7. Will work with Python 2.2 to 2.4; you just need to get ElementTree and cElementTree from effbot.org and do some conditional importing. Should work with any lxml version.
Output:
*** eyquem 1: '>'
*** eyquem 2: '>'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root
*** eyquem 1: '-->'
*** eyquem 2: '-->'
*** eyquem 3: FAIL
*** eyquem 4: '</root\n>'
*** eyquem 5: '<root\n>'
*** eyquem 6: '<root\n>'
*** parse: root
*** iterparse: root
*** eyquem 1: '<!-- </mole2> -->'
*** eyquem 2: '<root><foo /></root>'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: '<root>'
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root
*** eyquem 1: '>'
*** eyquem 2: '<?xml version="1.0" ?><!-- <mole1> --><root><foo /></root><!-- </mole2> -->'
*** eyquem 3: FAIL
*** eyquem 4: '</mole2>'
*** eyquem 5: FAIL
*** eyquem 6: '<mole1>'
*** parse: root
*** iterparse: root
Update 1 The above was demonstration code. Below is more like implementation code... just add exception handling. Tested with Python 2.7 and 2.2.
try:
import xml.etree.cElementTree as ET
except ImportError:
import cElementTree as ET
def get_root_tag_from_xml_file(xml_file_path):
result = f = None
try:
f = open(xml_file_path, 'rb')
for event, elem in ET.iterparse(f, ('start', )):
result = elem.tag
break
finally:
if f: f.close()
return result
if __name__ == "__main__":
import sys, glob
for pattern in sys.argv[1:]:
for filename in glob.glob(pattern):
print filename, get_root_tag_from_xml_file(filename)