Search code examples
c#encodingutf-8utf-16utf

Need help understanding UTF encodings


Hallo, I have noticed that when I save a text file using UTF-8 encoding (no BOM), I am able to read it perfectly using the UTF-16 encoding on C#. Now this got me a little confused cause UTF-8 only uses 8 bits, right? And utf-16 takes, well, 16 bits for each character.

Now imagine that I have the string "ab" written in this file as UTF-8, then there is one byte there for the letter "a" & another one for the "b".

Ok, but how is it possible to read this UTF-8 file when using UTF-16 charset? The way I see it, while reading the file, the two bytes of the "ab" would be mistaken into been only one character containing both bytes. Because UTF-16 needs those 2 bytes.

This is how I read it (t.txt is encoded as UTF-8):

using(StreamReader sr = new StreamReader(File.OpenRead("t.txt"), Encoding.GetEncoding("utf-16")))
{
    Console.Write(sr.ReadToEnd());
    Console.ReadKey();
}

Solution

  • Check out http://www.joelonsoftware.com/articles/Unicode.html, it will answer all your unicode questions