I have a string
which consist of the mixture of Chinese characters and displayable ASCII codes.
string str = "Test測試123";
When I use str.Length
or str.ToCharArray()
, it all return the Chinese character each as 1 character! Which is not true because any Chinese character is 2 byte!
Even if I try Encoding.ASCII.GetBytes(str)
, it just give me 63s in ALL the Chinese characters!!! And it turned out to be the same result as Length
or ToCharArray()
!
Which is the wrong result for my purpose!!!
Is there any way to get the actual length of a string!?
In the example I just given: 11 instead of 9!?
Length in the Unicode world is always fun... What Length do you need? For example:
string str = "🤣";
// Length in UTF-16 code units
int len = str.Length; // 2
// Length in bytes, if encoded in UTF16, as done by .NET
int len2 = str.Length * 2; // 4
// Length in bytes, if encoded in UTF8
int len3 = Encoding.UTF8.GetByteCount(str); // 4
// Length in unicode code points
int len4 = Encoding.UTF32.GetByteCount(str) / 4; // 1
Note that there is a fifth length: Length in number of grapheme cluster, that is even more complex to calculate, because some codepoints can "merge" together, and a sixth: Length in number of Glyphs.
Now, your string has len
equal to 9
, len2
equal to 18
, len3
(so the length in bytes if converted to UTF8) equal to 13
, len4
equal to 9.
Nearly all the chinese characters are in the Basic Multilingual Plane of the Unicode standard, so they have a length of 1 UTF-16 code unit, and they are mappable to 2 or 3 bytes in UTF8.
Some interesting reference: What's the difference between a character, a code point, a glyph and a grapheme? .
Ah... and please forget about the Encoding.ASCII
. Live like it doesn't exist. It probably isn't what you think it is. Even if you lived in the old MS DOS world with its funny characters, that wasn't ASCII.