Search code examples
sqlsql-serverxml

MS SQL Server and XML with special characters like É, Ó, ¿ º and °


I have an application that gets a XML back from an API, the XML is saved in a column in a MS SQL database. This XML is used to render a report.

My problem is that MS SQL Server (2014 - 2022) seems to have problems with these special chars in XML tags. To test this simply use this:

DECLARE @xml as XML
--SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>'
SET @xml = '<?xml version="1.0" encoding="UTF-8"?><test>this works</test>'
select  @xml.value('(/test)[1]','nvarchar(16)') as Test
FROM @xml.nodes('//*[text()]') AS x(y)

The first SET doesnt work - invalid xml symbol. The second SET works.

The normal .net XML stuff has no problems with this characters neither does the API which gives me this back.

Currently I string.replace these characters in .net but there are always more new ones coming. When there is a new invalid char it breaks the process and I have to add it into the string replace logic, which means recompiling the application.

I can't whitelist a-zA-Z because I need to replace É -> E and Ð -> D and so on.

Even CDATA-encasing doesn't help.


Solution

  • The problem here is that you are using a UTF-8 value, but you are not in a collation that is using UTF-8. As a result you get an error as when the value is read it isn't valid any more. Presumably you are using a Latin collation, which means your varchar value is read using ANSI-1252 for the code page; this means that the UTF-8 value can't be read properly and is causing an error on characters like É and ó (as they have different values in the ANSI-1252 vs UTF-8).

    The obvious answer would be that if you need to work with UTF-8 values then work in a UTF-8 environment. That means creating a UTF-8 database though; this could be a problem if you have an existing application and also as you are using 2014-2017 as well can be ruled out, as they don't have UTF-8 support.

    Another method would be to remove the declaration of the UTF-8, then your value will be read using the code page of the database:

    DECLARE @xml as XML;
    SET @xml = REPLACE('<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>',' encoding="UTF-8"','');
    select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;
    

    This, however, will be an issue if you have any characters that appear outside of the ANSI-1252 code page and presumably you do, as why use UTF-8 otherwise?

    Finally, another method could be to replace the UTF-8 with UTF-16. Note that you will need to also ensure that your literal string is then an nvarchar too, as otherwise you'll get an error:

    DECLARE @xml as XML;
    SET @xml = REPLACE(N'<?xml version="1.0" encoding="UTF-8"?><test>this doesnt work É ó</test>','"UTF-8"','"UTF-16"');
    select @xml.value('(/test/text())[1]','nvarchar(30)') as Test;