Search code examples
unicodemsxml4

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters


Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like

𠮢 (U+20BA2) 4bytes

and the xml string looks like

<City>City</City><Name>𠮢</Name>

So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like

Invalid unicode characters, & #55362;&#57250

Can someone please tell me what I can do to resolve this issue?

Thanks,

Edited

The code looks like this

    Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)  

Solution

  • I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.

    I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>&#55362;&#57250;</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.

    If 𠮢 is turned into a character reference it must be &#134050; (or &#x20BA2;). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.