Search code examples
c#stringsizeascii

The actual length of a string


I have a string which consist of the mixture of Chinese characters and displayable ASCII codes.

string str = "Test測試123";

When I use str.Length or str.ToCharArray(), it all return the Chinese character each as 1 character! Which is not true because any Chinese character is 2 byte!

Even if I try Encoding.ASCII.GetBytes(str), it just give me 63s in ALL the Chinese characters!!! And it turned out to be the same result as Length or ToCharArray()!

Which is the wrong result for my purpose!!!

Is there any way to get the actual length of a string!?

In the example I just given: 11 instead of 9!?


Solution

  • Length in the Unicode world is always fun... What Length do you need? For example:

    string str = "🤣";
    
    // Length in UTF-16 code units
    int len = str.Length; // 2
    
    // Length in bytes, if encoded in UTF16, as done by .NET
    int len2 = str.Length * 2; // 4
    
    // Length in bytes, if encoded in UTF8
    int len3 = Encoding.UTF8.GetByteCount(str); // 4
    
    // Length in unicode code points
    int len4 = Encoding.UTF32.GetByteCount(str) / 4; // 1
    

    Note that there is a fifth length: Length in number of grapheme cluster, that is even more complex to calculate, because some codepoints can "merge" together, and a sixth: Length in number of Glyphs.

    Now, your string has len equal to 9, len2 equal to 18, len3 (so the length in bytes if converted to UTF8) equal to 13, len4 equal to 9.

    Nearly all the chinese characters are in the Basic Multilingual Plane of the Unicode standard, so they have a length of 1 UTF-16 code unit, and they are mappable to 2 or 3 bytes in UTF8.

    Some interesting reference: What's the difference between a character, a code point, a glyph and a grapheme? .

    Ah... and please forget about the Encoding.ASCII. Live like it doesn't exist. It probably isn't what you think it is. Even if you lived in the old MS DOS world with its funny characters, that wasn't ASCII.