Presently i am working on xml.sax parser to parse xml files
suppose i had following code
filepath = 'users/file.xml'
try:
parser = xml.sax.make_parser( )
parser.parse(open(filepath))
except (xml.sax.SAXParseException), e:
print "*** PARSER error: %s" % e
file.xml
<?xml version="1.0" encoding="utf-8"?>
<tag1>
<tag2>
<P style="MARGIN: 0in 0in 0pt" class="MsoNormal"><FONT size="3"><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes"> </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
<DIV> </DIV>
</tag2>
</tag1>
When the parser reaches the & in div tag it stopping execution and displaying the following error
*** PARSER error: users/file.xml:5:1: not well-formed <invalid token>
How to remove or escape all the invalid xml tokens before giving to parser from the xml file, is there any function to escape & and special characters form the xml tags or else we need to loop through he xml file and remove each and every invalid token? but dont know how to do it. Can anyone please share the code of doing it.
Don't try to repair the bad XML. Fix the process that created bad XML in the first place. You haven't told us what program wrote this stuff. The whole point about XML is that it's a standard, and you only get benefits from it if people actually stick to the standard.