For example, I have XML like this:
<title>Very bad XML with & (unescaped)</title>
<title>Good XML with & and > (escaped)</title>
<title><![CDATA[ Good XML with & in CDATA ]]></title>
My task is to remove invalid ampersand characters from XML, but excluding those ampersand characters that are in CDATA. I found a regex that do it:
&(?!(?:apos|quot|[gl]t|amp);|#)
but unfortunately, it also removes ampersand characters from CDATA. How can I change this regex so that it satisfies my task?
As you're aware, the "XML" isn't XML due to the unescaped &
outside of CDATA.
Thus, you're stuck having to pre-process without the benefit of an XML parser to differentiate between CDATA and PCDATA. That's rough, and regex isn't up to to the task for all the reasons that regex isn't up to parsing XML.
Here's one approach that can help:
&
characters with &TEMP
, including those within CDATA.&TEMP
occurences within CDATA to &
.See also: How to parse invalid (bad / not well-formed) XML?
&
's