Search code examples
xmlencodeentities

Is it correct to escape "&", ">" and "<" with &#38;, &#62; and &#60; in XML?


Will something "break" if I use numeric entities instead of the usual recommended alpha entities for reserved chars in XML?

This is part of a rather complex app that allows users to enter bibliographic metadata via XML, CSV or web-based forms. This data can then be extracted in XML (using the ONIX standard) with user-chosen encodings: utf-8, win-1252, etc.

The original programmers (long gone now...) decided to use numeric entities for all chars that cannot be represented in the chosen encoding. XML-reserved chars are considered as non-representable under any encoding. They are given the same treatment and are encoded using numeric entities.

Some users have complained about &, <, >, etc. being encoded as &#38, etc. instead of using the usual alpha codes and I'd like to know if these complaints have any substance.

If I can avoid digging through the legacy code to change this behaviour, it would save me a lot of resources.


Solution

  • Yes, it's fine to escape using numeric character references.

    From the spec (emphasis mine):

    The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively. The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

    You could also use a hex entity reference...

    &amp; = &#38; = &#x26;

    &lt; = &#60; = &#x3C;

    &gt; = &#62; = &#x3E;