I have an application that gets a XML back from an API, the XML is saved in a column in a MS SQL database. This XML is used to render a report.
My problem is that MS SQL Server (2014 - 2022) seems to have problems with these special chars in XML tags. To test this simply use this:
DECLARE @xml as XML
--SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>'
SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this works</test>'
select @xml.value('(/test)[1]','nvarchar(16)') as Test
FROM @xml.nodes('//*[text()]') AS x(y)
The first SET doesnt work - invalid xml symbol. The second SET works.
The normal .net XML stuff has no problems with this characters neither does the API which gives me this back.
Currently I string.replace
these characters in .net but there are always more new ones coming. When there is a new invalid char it breaks the process and I have to add it into the string replace logic, which means recompiling the application.
I can't whitelist a-zA-Z because I need to replace É -> E and Ð -> D and so on.
Even CDATA-encasing doesn't help.
The problem here is that you are using a UTF-8 value, but you are not in a collation that is using UTF-8. As a result you get an error as when the value is read it isn't valid any more. Presumably you are using a Latin collation, which means your varchar
value is read using ANSI-1252 for the code page; this means that the UTF-8 value can't be read properly and is causing an error on characters like É
and ó
(as they have different values in the ANSI-1252 vs UTF-8).
The obvious answer would be that if you need to work with UTF-8 values then work in a UTF-8 environment. That means creating a UTF-8 database though; this could be a problem if you have an existing application and also as you are using 2014-2017 as well can be ruled out, as they don't have UTF-8 support.
Another method would be to remove the declaration of the UTF-8, then your value will be read using the code page of the database:
DECLARE @xml as XML;
SET @xml = REPLACE('<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>',' encoding="UTF-8"','');
select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;
This, however, will be an issue if you have any characters that appear outside of the ANSI-1252 code page and presumably you do, as why use UTF-8 otherwise?
Finally, another method could be to replace the UTF-8 with UTF-16. Note that you will need to also ensure that your literal string is then an nvarchar
too, as otherwise you'll get an error:
DECLARE @xml as XML;
SET @xml = REPLACE(N'<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>','"UTF-8"','"UTF-16"');
select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;