Search code examples
c#xmlreader

How to handle invalid XML response with utf-16 header over a utf-8 stream


I'm getting the following error in when trying to read some XML.

Exception has occurred: CLR/System.Xml.XmlException
Exception thrown: 'System.Xml.XmlException' in System.Private.Xml.dll: 'There is no Unicode byte order mark. Cannot switch to Unicode.'

I've identified this as the API is serving the content as utf-8 but the header is utf-16.

<?xml version="1.0" encoding="utf-16"?>

I've confirmed this in tests from static files by deleting the encoding or saving the file in utf-16. I have also confirmed that the incoming response is utf-8 looking in the response Content.Headers.ContentType.

Unfortunately I don't maintain the API and don't think that this will be getting fixed any time soon.

Is there a way to make a System.Text.XmlReader ignore the header in the stream, would be nice if there were a flag to simply ignore the doctype if they can't be bothered to make it accurate?

I think you can correct the content of XML using some kind of Schema replacement prior to final parsing?

I could always think about re-encoding the same content but it seems a little mad.

var mockBytes = System.Text.Encoding.UTF8.GetBytes("<?xml version=\"1.0\" encoding=\"utf-16\"?>");
var mockStream = MemoryStream new(mockBytes);

XmlReaderSettings settings = new XmlReaderSettings();
settings.Async = true;
using (var reader = XmlReader.Create(mockStream, settings))
{
    if (reader.ReadToFollowing("Message") & await reader.ReadAsync()) 
    {
        while (await reader.MoveToContentAsync() == XmlNodeType.Element)
        { 
          ...
        }
    }
}

Solution

  • Thank you for the comments. Using them I have been able to test that simply instantiating and passing a StreamReader is all that is required to stop the XmlReader interpreting the encoding meta in the document type definition.

    var mockBytes = System.Text.Encoding.UTF8.GetBytes("<?xml version=\"1.0\" encoding=\"utf-16\"?>");
    var mockStream = MemoryStream new(mockBytes);
    
    var sr = new StreamReader(mockStream);
    
    XmlReaderSettings settings = new XmlReaderSettings();
    settings.Async = true;
    using (var reader = XmlReader.Create(sr, settings))
    {
        if (reader.ReadToFollowing("Message") & await reader.ReadAsync()) 
        {
            while (await reader.MoveToContentAsync() == XmlNodeType.Element)
            { 
              ...
            }
        }
    }
    

    This is the simplest solution I can imagine other than a flag on the XmlReaderSettings that doesn't seem to exist.

    Furthermore, as @Jereon says, skipping to particular characters or line endings would get very brittle and fall over if some other change happened at the API. You would really have to try and look more carefully, perhaps pushing elements into a stack between <? + ?> not easy and also fortunately not necessary.