Search code examples
regexxmlcdataampersand

How to write regex for XML which removes unescaped ampersand characters except CDATA?


For example, I have XML like this:

<title>Very bad XML with & (unescaped)</title>
<title>Good XML with &amp; and &#x3E; (escaped)</title>
<title><![CDATA[ Good XML with & in CDATA ]]></title>

My task is to remove invalid ampersand characters from XML, but excluding those ampersand characters that are in CDATA. I found a regex that do it:

&(?!(?:apos|quot|[gl]t|amp);|#)

but unfortunately, it also removes ampersand characters from CDATA. How can I change this regex so that it satisfies my task?


Solution

  • As you're aware, the "XML" isn't XML due to the unescaped & outside of CDATA. Thus, you're stuck having to pre-process without the benefit of an XML parser to differentiate between CDATA and PCDATA. That's rough, and regex isn't up to to the task for all the reasons that regex isn't up to parsing XML.

    Here's one approach that can help:

    1. Use regex to replace all isolated (not part of a character entity) & characters with &amp;TEMP, including those within CDATA.
    2. Using an XML parser on the now well-formed XML, restore the &amp;TEMP occurences within CDATA to &.

    See also: How to parse invalid (bad / not well-formed) XML?

    • General advice on parsing messy "XML"
    • Tolerant parsers
    • Regex's for matching invalid characters and &'s