I have an xml file.
<?xml version="1.0" encoding="UTF-8"?>
<channel>
<item>content with special character é</item>
</channel>
Assume that the above is the xml file, except with content from a product catalogue, with a lot more tags and content. This is created using the following process:
This gives me errors when I try to open the file in Firefox (my way of testing the parsing of the xml file). Tells me that I have some special characters that need escaping ("xml not well-formed" or something like that). So I put CDATA tags inside these xml tags, which should clear this up, right? It doesn't. It keeps stumbling over special characters, not just the ones that are reserved for xml (&, <, >, ..).
Here's when I started loosing it. After some trying and testing in creating smaller xml files manually (not through coldfusion), I got it to work, just by dropping the CDATA tags and just inserting the above code. Firefox parses the above code just fine. So after some thinking, I just copied the entire contents of the faulty file, the original one, to a brand new manually created xml file (.txt --> renamed to .xml) and voila, no more errors.
Can somebody please explain to me how, in this case, 2 seperate files, with the exact same content, copied from the first to the second, get parsed differently. The first one showing multiple errors on special characters, the second one have no problem with these at all..? Please, someone, before I go berserk at my desk here.. >_>
Edit 1: When I say special characters, I specifically mean utf-8 characters. I'm not talking about the characters reserved for xml (&, <, >, ...), I already escape these.
There are no special characters in the example you give, just normal ones like c
, é
, (I suppose space is a bit special), etc.
I would guess from what you describe that you are using the incorrect encoding. You're saying it's UTF-8, but is it really?
If this is the problem, you've three solutions:
(This dummy line added because the parser seems not to like code blocks right after lists)
<?xml version="1.0" encoding="UTF-8"?>
<channel>
<item>content with special character é</item>
</channel>
Number one is the best to go for, but failing that there's pros and cons with the other two. In favour of number 2, it won't escape as many characters that don't really need to be escaped. In favour of number 3, only UTF-8 and UTF-16 have to be accepted by an XML parser, and faking it this way will work with any character set that's the same as UTF-8 for the range U+0000 to U+007F, which is a lot of them.