Search code examples
sql-serverxmlunicodecharacter-encodingtype-conversion

Converting accented characters in varchar() to XML causing "illegal XML character"


I have table written to by an application. The field is varchar(max). The data looks like xml.

DECLARE @poit VARCHAR(100)
SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
SELECT CONVERT(XML,@poit)

But (seemingly because of the UTF8; removing it works), I get this error:

XML parsing: line 1, character 46, illegal xml character

Is there a way to cleanly convert it?

I found this thread, which talks about varchar not supporting "non-ASCII characters", though obviously the I is non-unicode. Yes, I can do this:

SELECT CONVERT(XML, REPLACE(@poit, 'encoding="utf-8"', ''))

But is that the best way?

Why does casting a UTF-8 VARCHAR column to XML require converting to NVARCHAR and encoding change?


Solution

  • I would try changing the datatype of your @poit variable from VARCHAR(100) to NVARCHAR(100). Then replace the utf-8 encoding with utf-16 so your code would look something like:

        DECLARE @poit NVARCHAR(100)
        SET @poit = '<?xml version="1.0" encoding="utf-8"?><test>VÍA</test>'
        SELECT CONVERT(XML,REPLACE(@poit, 'utf-8', 'utf-16'))
    

    As long as you're not calling the conversion with the replace in it in a SELECT that returns oodles of results, the performance should be just fine and it will get the job done.

    Reference: http://xml.silmaril.ie/characters.html <- scroll down and you'll see some info as to the difference between utf-8 & utf-16. Hope this helps!