Search code examples
c++character-encodinghtmlspecialchars

C++ tolower on special characters such as ü


I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.

How I use my function:

string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
    LowerCase += tolower(NotLowerCase[i]);
    }

For example:

  1. Test -> test
  2. TeST2 -> test2
  3. Grüßen -> gr????en
  4. (§) -> ()

3 and 4 are not working as expected as you can see

How can I fix this issue? I have to keep the special chars, but as lowercase.


Solution

  • The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.

    #include <iostream>
    #include <cctype>
    #include <clocale>
    
    int main()
    {
        unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                                  // but ´ (acute accent) in ISO-8859-1 
    
        std::setlocale(LC_ALL, "en_US.iso88591");
        std::cout << std::hex << std::showbase;
        std::cout << "in iso8859-1, tolower('0xb4') gives "
                  << std::tolower(c) << '\n';
        std::setlocale(LC_ALL, "en_US.iso885915");
        std::cout << "in iso8859-15, tolower('0xb4') gives "
                  << std::tolower(c) << '\n';
    }
    

    You might also change std::string to std::wstring which is Unicode on many C++ implementations.

    wstring NotLowerCase = L"Grüßen";
    wstring LowerCase;
    for (auto&& ch : NotLowerCase) {
        LowerCase += towlower(ch);
        }
    

    Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.

    Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.

    Finally, C++ has more sophisticated support for managing locales, see <locale> for details.