c#unicode utf-8 character-encoding utf-16

How to Determine "Lowest" Encoding Possible?

Scenario

You have lots of XML files stored as UTF-16 in a Database or on a Server where space is not an issue. You need to take a large majority of these files that you need to get to other systems as XML Files and it is critical that you use as little space as you can.

Issue

In reality only about 10% of the files stored as UTF-16 need to be stored as UTF-16, the rest can safely be stored as UTF-8 and be fine. If we can have the ones that need to be UTF-16 be such, and the rest be UTF-8 we can use about 40% less space on the file system.

We have tried to use great compression of the data and this is useful but we find that we get the same ratio of compression with UTF-8 as we get with UTF-16 and UTF-8 compresses faster as well. Therefore in the end if as much of the data is stored as UTF-8 as possible we can not only save space when stored uncompress, we can still save more space even when it is compressed, and we can even save time with the compression itself.

Goal

To figure out when there are Unicode characters in the XML file that require UTF-16 so we can only use UTF-16 when we have to.

Some Details about XML File and Data

While we control the schema for the XML itself, we do not control what type of "strings" can go in the values from a Unicode perspective as the source is free to provide Unicode data to use. However, this is rare so we would like not to have to use UTF-16 everytime just to support something that is only needed 10% of the time.

Development Environment

We are using C# with the .Net Framework 4.0.

EDIT: Solution

The solution is just to use UTF-8.

The question was based on my misunderstanding of UTF and I appreciate everyone helping set me straight. Thank you!

Solution

Encode everything in UTF-8. UTF-8 can handle anything UTF-16 can, and is almost surely going to be smaller in the case of an XML document. The only case in which UTF-8 would be larger than UTF-16 would be if the file was largely composed of characters beyond the BMP, and in the best case (ASCII-spec, which includes every character you can type on a standard U.S. 104-key) a UTF-8 file would be half the size of a UTF-16.

UTF-8 requires 2 bytes or less per character for all symbols at or below ordinal U07FF, and one byte for any character in the Extended ASCII codepage; that means UTF-8 will be at least equal to UTF-16 in size (and probably far smaller) for any document in a modern-day language using the Latin, Greek, Cyrillic, Hebrew or Arabic alphabets, including most of the common symbols used in algebra and the IPA. That's known as the Base Multilingual Plane, and encompasses more than 90% of all official national languages outside of Asia.

UTF-16, as a general rule, will give you a smaller file for documents written primarily in the Devanagari (Hindi), Japanese, Chinese, or Hangul (Korean) alphabets, or any ancient or "esoteric" alphabet (Cherokee or Inuit anyone?), and MAY be smaller in cases of documents that heavily use specialized mathematical, scientific, engineering or game symbols. If the XML you're working with is for localization files for India, China and Japan, you MAY get a smaller file size with UTF-16, but you will have to make your program smart enough to know the localization file is encoded that way.