Search code examples
visual-studio-2015mfcarabictoupper

Converting characters to uppercase in Arabic


I have this code to convert letters to uppercase:

// make this character upper
if(_istalpha(zChar) && !_istupper(zChar))
   pMsg->wParam = (WPARAM)_toupper(zChar);

It has worked for years. Recently I was asked to support Arabic and my user said letters were getting corrupted. It is because of the above code.

I am told in Arabic that uppercase does not apply. I know I can test my program settings to see if they are using Arabic and avoid this code. But is there another way?

I know with dates you call _tsetlocale first for example.

Update:

Located this topic about toupper which mentions the locale setting! Will try it.


Solution

  • As you've discovered, the classic conversion routines like the CRT's toupper and Win32's CharUpper are rather dumb. They generally hail from the time when all the world was assumed to be ASCII.

    What you need is a linguistically-sensitive conversion. This is a computationally more expensive operation, but also very difficult to implement correctly. Languages are hard. So you want to offload the responsibility if at all possible to a standard library. Since you're using MFC, you're obviously targeting the Windows operating system, which means you're in luck. You can piggyback on the hard work of Microsoft's localization engineers, giving the additional benefit of consistency with the shell and other OS components.

    The function you need to call is LCMapStringEx (or LCMapString if you are still targeting pre-Vista platforms). The complexity of this function's signature serves as strong testament to the complicated task of proper linguistically-aware string handling.

    • First, you need to choose a locale. You usually want the user's default locale, which you can specify with LOCALE_NAME_USER_DEFAULT, but you can use anything you want here.
    • For the flags, you will want LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING. To do the reverse operation, you'd use LCMAP_LOWERCASE | LCMAP_LINGUISTIC_CASING. There are lots of other interesting and useful options here to keep in mind, too.
    • Then you have a pointer to the source string, and its length in characters (code units).
    • And a pointer to a string buffer that receives the results, as well as its maximum length in characters (code units).
    • The final three parameters can simply be set to NULL or 0.

    Putting it all together:

    BOOL ConvertToUppercase(std::wstring& buffer)
    {
        return LCMapStringEx(LOCALE_NAME_USER_DEFAULT  /* or whatever locale you want */,
                             LCMAP_UPPERCASE | LCMAP_LINGUISTIC_CASING,
                             buffer.c_str(),
                             buffer.length(),
                             &buffer[0],
                             buffer.length(),
                             NULL,
                             NULL,
                             0);
    }
    

    Note that I'm doing an in-place conversion here of the contents of the buffer, and therefore assuming that the uppercased string is exactly the same length as the original input string. This is probably true, but may not be a universally safe assumption, so you will either want to add handling for such errors (ERROR_INSUFFICIENT_BUFFER) and/or defensively add some extra padding to the buffer.

    If you'd prefer to use CRT functions like you're doing now, _totupper_l and its friends are wrappers around LCMapString/LCMapStringEx. Note the _l suffix, which indicates that these are the locale-aware conversion functions. They allow you to pass an explicit locale, which will be used in the conversion.