Search code examples
xmlencodingcoldfusionxml-parsingcdata

xml files with same content differently parsed


I have an xml file.

<?xml version="1.0" encoding="UTF-8"?>
<channel>
    <item>content with special character é</item>
</channel>

Assume that the above is the xml file, except with content from a product catalogue, with a lot more tags and content. This is created using the following process:

  1. call database from coldfusion file
  2. get content from database with procedure and return to coldfusion file
  3. create an xml file in coldfusion (just by using a filename, ex: "filename.xml")
  4. write the contents to the file by looping through the query in coldfusion and adding product per product to the xml file

This gives me errors when I try to open the file in Firefox (my way of testing the parsing of the xml file). Tells me that I have some special characters that need escaping ("xml not well-formed" or something like that). So I put CDATA tags inside these xml tags, which should clear this up, right? It doesn't. It keeps stumbling over special characters, not just the ones that are reserved for xml (&, <, >, ..).

Here's when I started loosing it. After some trying and testing in creating smaller xml files manually (not through coldfusion), I got it to work, just by dropping the CDATA tags and just inserting the above code. Firefox parses the above code just fine. So after some thinking, I just copied the entire contents of the faulty file, the original one, to a brand new manually created xml file (.txt --> renamed to .xml) and voila, no more errors.

Can somebody please explain to me how, in this case, 2 seperate files, with the exact same content, copied from the first to the second, get parsed differently. The first one showing multiple errors on special characters, the second one have no problem with these at all..? Please, someone, before I go berserk at my desk here.. >_>

Edit 1: When I say special characters, I specifically mean utf-8 characters. I'm not talking about the characters reserved for xml (&, <, >, ...), I already escape these.


Solution

  • There are no special characters in the example you give, just normal ones like c, é, (I suppose space is a bit special), etc.

    I would guess from what you describe that you are using the incorrect encoding. You're saying it's UTF-8, but is it really?

    If this is the problem, you've three solutions:

    1. Fix the code to write the file in UTF-8.
    2. Fix the code to describe the encoding it's actually in (do so in both the HTTP headers and the XML declaration).
    3. Keep saying it's UTF-8, but escape any character outside of the US-ASCII range (U+0000 to U+007F). E.g. you'd output the above as:

    (This dummy line added because the parser seems not to like code blocks right after lists)

    <?xml version="1.0" encoding="UTF-8"?>
    <channel>
        <item>content with special character &#xe9;</item>
    </channel>
    

    Number one is the best to go for, but failing that there's pros and cons with the other two. In favour of number 2, it won't escape as many characters that don't really need to be escaped. In favour of number 3, only UTF-8 and UTF-16 have to be accepted by an XML parser, and faking it this way will work with any character set that's the same as UTF-8 for the range U+0000 to U+007F, which is a lot of them.