Search code examples
javaxmlampersand

How can I make my xml safe for parsing (when it has & character in it)?


I've been given an xml string which I need to put through a parser. Its currently complaining because of an illegal xml character. Very simplified example:

<someXml>this & that</someXml>

I know that the solution is to replace & with &amp;, but I'm not generating the XML and therefore have no control over the values.

A simple string replace is not the right way to to this since the '&' has special meaning in XML and a global replace of '&' with '&amp;' would ruin the special meaning which was intended. Is there a solution to take a full xml document and 'fix' it so that '&' become '&amp;', but only where intended? Am I safe to globally replace ' & ' with ' &amp; ' (note the spaces on either side)?


Solution

  • I think this an interesting question, because it's a situation that may really happen in real-life. Although I believe that the right thing to do is asking the XML provider to fix the XML and make it valid, I thought one option was trying with a lenient parser. I did some search and I found this blog post talking about this same problem, and suggesting the same solution that I was think of. You may try with jsoup. Let me repeat that I think this is not the best thing to do: you should really ask the XML provider to fix it.