Search code examples
pythonxmlcdata

XML CDATA section special character encoding Error when trying to open in browser


I have several XML files on our Amazon S3 server with our company's ads we want to display on various sites. Some of them require all the info with CDATA tags. But when I try to open the file with my browser it always gives me encoding Errors because of the special characters in the text. The text for each file is in another language (French, Spanish, etc.).

But isn't the CDATA section meant to ignore all special characters? I'm pretty new to Python, XML, etc, but I couldn't find an answer on Google (maybe I'm phrasing the problem in a wrong way, idk).

As soon as I encode the special characters (e.g. &) and remove the CDATA tags I can view the file with my browser without problem.

<?xml version="1.0" encoding="utf-8"?>
<source xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

<job>
<id><![CDATA[removed]]></id>
<url><![CDATA[removed]]></url>
<title><![CDATA[removed]]></title>
<description><![CDATA[removed]]></description>
<date><![CDATA[removed]]></date>
<country><![CDATA[removed]]></country>
<city><![CDATA[removed]]></city>
<company><![CDATA[removed]]></company>
</job>

</source>

I expected to be able to put any special characters into CDATA without any problems but I'm not able to.


Solution

  • Using CDATA means you don't need to escape the XML-special characters like "<" and "&" as &lt; and &amp;. But it doesn't affect the handling of non-ASCII characters such as French accented letters. These need to be encoded (not escaped) using the character encoding declared in the XML declaration, just as if they were not in CDATA. (It's very anglo-centric to regard these characters as being in any way "special").