Search code examples
c#html.netasciiutf-16

Is UTF-16 a superset of ASCII? If yes, why is UTF-16 incompatible with ASCII according to the HTML Standard?


According to the Wikipedia article on UTF-16, "...[UTF-16] is also the only web-encoding incompatible with ASCII." (at the end of the abstract.) This statement refers to the HTML Standard. Is this a wrong statement?

I'm mainly a C# / .NET dev, and .NET as well as .NET Core uses UTF-16 internally to represent strings. I'm pretty certain that UTF-16 is a superset of ASCII, as I can easily write code that displays all ASCII characters:

public static void Main()
{
    for (byte currentAsciiCharacter = 0; currentAsciiCharacter < 128; currentAsciiCharacter++)
    {
        Console.WriteLine($"ASCII character {currentAsciiCharacter}: \"{(char) currentAsciiCharacter}\"");
    }
}

Sure, the control characters will mess up the console output, but I think my statement is clear: the lower 7 bits of a 16 bit char take the corresponding ASCII code point, while the upper 9 bits are zero. Thus UTF-16 should be a superset of ASCII in .NET.

I tried to find out why the HTML Standard says that UTF-16 is incompatible to ASCII, but it seems like they simply define it that way:

An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding.

I couldn't find any explanations why UTF-16 is not compatible in their spec.

My detailed questions are:

  1. Is UTF-16 actually compatible to ASCII? Or did I miss something here?
  2. If it is compatible, why does the HTML Standard say it's not compatible? Maybe because of byte ordering?

Solution

  • ASCII is 7 bit encoding and stored in a single byte. UTF-16 uses 2 bytes chunks (ord) , which makes it right away incompatible. UTF-8 uses one byte chunk and for English alphabet matches with ASCII. IOW, UTF-8 is designed to be backward compatible with ASCII encoding.