Search code examples
c++case-insensitiveicu

Case insensitive operations


I'm working on a project wherein the case sensitive operations needs to be replaced with case insensitive operations. After doing some reading on this, the type of data to be considered are:

  1. Ascii characters
  2. Non-ascii characters
  3. Unicode characters

Please let me know if I've missed anything in the list.

Do the above need to be handled separately or are there libraries for C++ which can handle them all without concerning the type of data?

Specifically:

  1. Does the boost library provide support for this? If so, are there sample examples or documentation on how to use the APIs?

  2. I learned about IBM's International Components of Unicode (ICU). Is this a library that provides support for case insensitive operations? If so, are there sample examples or documentation on how to use the APIs?

Finally, which among the aforementioned (and other) approaches is better and why?

Thanks!

Based on the comments and answers, I wrote a sample program to understand this better:

#include <iostream>       // std::cout
#include <string>         // std::string
#include <locale>         // std::locale, std::tolower

using namespace std;

void ascii_to_lower(string& str)
{
     std::locale loc;
     std::cout << "Ascii string: " << str;
     std::cout << "Lower case: ";

     for (std::string::size_type i=0; i<str.length(); ++i)
         std::cout << std::tolower(str[i],loc);
     return;
}

void non_ascii_to_lower(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"Zoë Saldaña played in La maldición del padre Cardona.";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.tolower(&str[0], &str[0] + str.size());

    std::wcout << "Lower case: " << str << endl;

    return;
}

void non_ascii_to_upper(void)
{
    std::locale::global(std::locale("en_US.UTF-8"));
    std::wcout.imbue(std::locale());
    const std::ctype<wchar_t>& f = std::use_facet<std::ctype<wchar_t> >(std::local
    std::wstring str = L"¥£ªÄë";

    std::wcout << endl << "Non-Ascii string: " << str << endl;

    f.toupper(&str[0], &str[0] + str.size());

    std::wcout << "Upper case: " << str << endl;

    return;
}

int main ()
{
    string str="Test String.\n";

    ascii_to_lower(str);
    non_ascii_to_upper();
    non_ascii_to_lower();

    return 0;
}

The output is:

Ascii string: Test String. Lower case: test string.

Non-Ascii string: ▒▒▒▒▒ Upper case: ▒▒▒▒▒

Non-Ascii string: Zo▒ Salda▒a played in La maldici▒n del padre Cardona. Lower case: zo▒ salda▒a played in la maldici▒n del padre cardona.

The non-ascii string, though seems to get converted to upper and lower case, some of the text is not visible in the output. Why is this?

On the whole, does the sample code look ok?


Solution

  • I'm a little surprised by this question. A simple search of boost case conversion came up with as the first entry: Usage - 1.41.0 - Boost which has a entry on case conversion.

    A search of stl case conversion has an entry tolower - C++ Reference - Cplusplus.com which also shows how to convert using the STL.

    To do a case insensitive search, convert both to lower or upper case and compare.

    Example from code from boost.org:

    string str1("HeLlO WoRld!");
    to_upper(str1); // str1=="HELLO WORLD!"
    

    Example from Cplusplus.com:

    // tolower example (C++)
    #include <iostream>       // std::cout
    #include <string>         // std::string
    #include <locale>         // std::locale, std::tolower
    
    int main ()
    {
      std::locale loc;
      std::string str="Test String.\n";
      for (std::string::size_type i=0; i<str.length(); ++i)
        std::cout << std::tolower(str[i],loc);
      return 0;
    }
    

    For ASCII characters (characters with an ASCII value < 128), there should be no problem. If you are using MCBS, you may need to use locals for code pages. Unicode should have no problems AFAIK.

    As to Matt Jordan's comment:

    The real issue with this request is that many languages have contextual requirements for case conversion - e.g. capital sigma 0x3A3 in Greek should become either 0x03C3 or 0x03C2, depending on whether it is at the end of a word or not.

    I would be pleasantly surprised if the boost library supported this. You would have to test it and report bugs if they don't. There's no reference on their page to say if they do any contextual case conversions. A work around might be to test for both converting to lowercase and comparing, and converting to uppercase and comparing. If either is true, then there's a match, which should work for 99.99% of the cases.

    An interesting paper by Bjarne Stroustrup, found here, is a good read regarding Locales.