I have an XML document that contains attributes like the following:
<Tag Body="<p>">
I want to preserve the text in the Body attribute exactly as-is; however, the parsing method is converting the text to "<p>". I want to keep the "&", "l", "t", ";", etc.
I'm using the Java SAX API to parse the XML document like so:
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser saxParser = spf.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new MyHandler());
xmlReader.setErrorHandler(new MyErrorHandler(System.err));
xmlReader.parse(convertToFileURL(myFileName));
The relevant code in MyHandler.java
is:
public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
throws SAXException
{
if (qName.equals("Tag")){
String Body = atts.getValue("Body");
char []s = Body.toCharArray(); // s[0] will be "<", but I want it to be "&"
}
}
How can I get the parsing method to leave the attribute text alone and not try to convert anything?
I'll answer my own question.
I didn't find a way to stop the parser from unescaping the text to begin with, but I did find a workaround (thatnks @user1516873) to re-escape it afterwards using Apache Commons:
String Body = atts.getValue("Body");
String Body_escaped = StringEscapeUtils.escapeXml(Body);
This achieves the desired results.