Search code examples
c++stringunicodestring-length

Loop through Unicode string as character


With the following string, the size is incorrectly output. Why is this, and how can I fix it?

string str = " ██████";
cout << str.size();
// outputs 19 rather than 7

I'm trying to loop through str character by character so I can read it into into a vector<string> which should have a size of 7, but I can't do this since the above code outputs 19.


Solution

  • TL;DR

    The size() and length() members of basic_string returns the size in units of the underlying string, not the number of visible characters. To get the expected number:

    • Use UTF-16 with u prefix for very simple strings that contain no non-BMP, no combining characters and no joining characters
    • Use UTF-32 with U prefix for very simple strings that don't contain any combining or joining characters
    • Normalize the string and count for arbitrary Unicode strings

    " ██████" is a space followed by a series of 6 U+2588 characters. Your compiler seems to be using UTF-8 for std::string. UTF-8 is a variable-length encoding and many letters are encoded using multiple bytes (because obviously you can't encode more than 256 characters with just one byte). In UTF-8 code points between U+0800 and U+FFFF are encoded by 3 bytes. Therefore the length of the the string in UTF-8 is 1 + 6*3 = 19 bytes.

    You can check with any Unicode converter like this one and see that the string is encoded as 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 in UTF-8, and you can also loop through each byte of your string to check

    If you want the total number of visible characters in the string then it's a lot trickier and churill's solution doesn't work. Read the example in Twitter

    If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:

    café  0x63 0x61 0x66 0xC3 0xA9        Using the “é” character, called the “composed character”.
    café  0x63 0x61 0x66 0x65 0xCC 0x81   Using the combining diacritical, which overlaps the “e”
    

    You need a Unicode library like ICU to normalize the string and count. Twitter for example uses Normalization Form C

    Edit:

    Since you're only interested in box-drawing characters which doesn't seem to lie outside the BMP and don't contain any combining characters, UTF-16 and UTF-32 will work. Like std::string, std::wstring is also a basic_string and doesn't have a mandatory encoding. In most implementations it's often either UTF-16 (Windows) or UTF-32 (*nix) so you may use it, but it's unreliable and depends on source code encoding. The better way is to use std::u16string (std::basic_string<char16_t>) and std::u32string (std::basic_string<char32_t>). They'll work regardless of system and encoding of the source file

    std::wstring wstr     = L" ██████";
    std::u16string u16str = u" ██████";
    std::u32string u32str = U" ██████";
    std::cout << str.size();    // may work, returns the number of wchar_t characters
    std::cout << u16str.size(); // always returns the number of UTF-16 code units
    std::cout << u32str.size(); // always returns the number of UTF-32 code units
    

    In case you're interested in how to work out on that for all Unicode characters continue reading below

    The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.

    [...]

    Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes

    Twitter - Counting characters

    See also