Search code examples
c#character-encodingstreamreader

c# MemoryStream Encoding Vs. Encoding.GetChars()


I am trying to copy a byte stream from a database, encode it and finally display it on a web page. However, I am noticing different behavior encoding the content in different ways (note: I am using the "Western European" encoding which has a Latin character set and does not support chinese characters):

var encoding = Encoding.GetEncoding(1252 /*Western European*/);
using (var fileStream = new StreamReader(new MemoryStream(content), encoding))
{
    var str = fileStream.ReadToEnd();
}

Vs.

var encoding = Encoding.GetEncoding(1252 /*Western European*/);
var str = new string(encoding.GetChars(content));

If the content contains Chinese characters than the first block of code will produce a string like "D$教学而设计的", which is incorrect because the encoding shouldn't support those characters, while the second block will produce "D$教学而设计的" which is correct as those are all in the Western European character set.

What is the explanation for this difference in behavior?


Solution

  • The StreamReader constructor will look for BOMs in the stream and set its encoding from them, even if you pass a different encoding.

    It sees the UTF8 BOM in your data and correctly uses UTF8.

    To prevent this behavior, pass false as the third parameter:

    var fileStream = new StreamReader(new MemoryStream(content), encoding, false)