Search code examples
c++stringstdstdstring

Store each character from a std::string into a std::string


I would like to know what method you would use to get each character from a std::string and store it in another std::string.

I find the problem when the std::string has special characters, such as "á". If I do:

std::string test = "márcos";

std::string char1 = std::string(1, test.at(0));
std::string char2 = std::string(1, test.at(1));
std::string char3 = std::string(1, test.at(2));
std::string char4 = std::string(1, test.at(3));

std::cout << "Result: " << char1 << " -- " << char2 << " -- " << char3  << " -- " << char4 << std::endl;

Output: Result: m -- � -- � -- r

As you can see, the desired result would be: "m - á - r - c" but this is not the case because the special character is stored as two characters.

How can we solve this? thanks :)


Solution

  • The number of bytes (between one and four) used to encode a codepoint in UTF-8 can be determined by looking at the high bits of the leading byte.

    bytes    codepoints             byte 1    byte 2    byte 3    byte 4
      1      U+0000  .. U+007F      0xxxxxxx        
      2      U+0080  .. U+07FF      110xxxxx  10xxxxxx        
      3      U+0800  .. U+FFFF      1110xxxx  10xxxxxx  10xxxxxx        
      4      U+10000 .. U+10FFFF    11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
    

    The following breaks a UTF-8 encoded std::string into the individual characters.

    #include <string>
    #include <iostream>
    
    int bytelen(char c)
    {
        if(!c)                  return 0;   // empty string
        if(!(c & 0x80))         return 1;   // ascii char       ($)
        if((c & 0xE0) == 0xC0)  return 2;   // 2-byte codepoint (¢)
        if((c & 0xF0) == 0xE0)  return 3;   // 3-byte codepoint (€)
        if((c & 0xF8) == 0xF0)  return 4;   // 4-byte codepoint (𐍈)
    
        return -1;                          // error
    }
    
    int main()
    {
        std::string test = "$¢€𐍈";
        std::cout << "'" << test << "' length = " << test.length() << std::endl;
    
        for(int off = 0, len; off < test.length(); off += len)
        {
            len = bytelen(test[off]);
            if(len < 0) return 1;
    
            std::string chr = test.substr(off, len);
            std::cout << "'" << chr << "'" << std::endl;
        }
    
        return 0;
    }
    

    Output:

    '$¢€𐍈' length = 10
    '$'
    '¢'
    '€'
    '𐍈'