Search code examples
xmlescapingcdata

Is there a way to escape a CDATA end token in xml?


I was wondering if there is any way to escape a CDATA end token (]]>) within a CDATA section in an xml document. Or, more generally, if there is some escape sequence for using within a CDATA (but if it exists, I guess it'd probably only make sense to escape begin or end tokens, anyway).

Basically, can you have a begin or end token embedded in a CDATA and tell the parser not to interpret it but to treat it as just another character sequence.

Probably, you should just refactor your xml structure or your code if you find yourself trying to do that, but even though I've been working with xml on a daily basis for the last 3 years or so and I have never had this problem, I was wondering if it was possible. Just out of curiosity.

Edit:

Other than using html encoding...


Solution

  • You cannot escape a CDATA end sequence. Production rule 20 of the XML specification is quite clear:

    [20]    CData      ::=      (Char* - (Char* ']]>' Char*))
    

    EDIT: This product rule literally means "A CData section may contain anything you want BUT the sequence ']]>'. No exception.".

    EDIT2: The same section also reads:

    Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using "<" and "&". CDATA sections cannot nest.

    In other words, it's not possible to use entity reference, markup or any other form of interpreted syntax. The only parsed text inside a CDATA section is ]]>, and it terminates the section.

    Hence, it is not possible to escape ]]> within a CDATA section.

    EDIT3: The same section also reads:

    2.7 CDATA Sections

    [Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string "<![CDATA[" and end with the string "]]>":]

    Then there may be a CDATA section anywhere character data may occur, including multiple adjacent CDATA sections inplace of a single CDATA section. That allows it to be possible to split the ]]> token and put the two parts of it in adjacent CDATA sections.

    ex:

    <![CDATA[Certain tokens like ]]> can be difficult and <invalid>]]> 
    

    should be written as

    <![CDATA[Certain tokens like ]]]]><![CDATA[> can be difficult and <valid>]]>