Search code examples
c++utf-8localeicu

iterate over characters blocks in utf-8


My task is to iterate over all the utf-8 character codes corresponding to a given language (locale). I suppose it's not that easy and I have to iterate over characters blocks (like the whole cyrilic for "ru_RU", for example). I can find characters blocks on the wiki page https://en.wikipedia.org/wiki/UTF-8, but I hope there are better ways than inventing my own bicycle.

I've had a look at icu-project, but I can't figure out if I can do what I need.

What I want to have as result is something like this:

for (unsignet int=UBLOCK_GREEK_EXTENDED; i<UBLOCK_GREEK_EXTENDED_SIZE; i++) {
    // do stuff
}

icu-project is a very powerfull tool, so I hope someone know how to do this :)

UPDATE: I'm working on a localization options for a 3D framework for mobile devices. It rasterizes and encodes truetype fonts so they can be easily rendered by picking required images from rasterized fonts files. Since I have to care about memory amount, I want to split rasterized font in different files for different locales (or languages, or characters blocks like cirylic or greek), so I don't have to keep the whole utf-8 font in memory all the time, but only load corresponding file after detecting locale.

Thanks!


Solution

  • So, I've finaly fund the way to do it properly usind the icu-project library http://site.icu-project.org.

    Here is an example solution. You specify locale or language and get an array of utf-8 character blocks that contain symbols relative to the locale/language. You can then get start and end for each characters block.

    UErrorCode err = U_ZERO_ERROR;
    const int32_t capacity = 10;
    const char* shortname = NULL;
    int32_t num, j;
    int32_t strLength = 4;
    UScriptCode script[10] = {USCRIPT_INVALID_CODE};
    num = uscript_getCode("en", script, capacity, &err);
    UnicodeString temp = UnicodeString("[", 1, US_INV);
    UnicodeString pattern;
    for(j=0; j<num; j++) {
        shortname = uscript_getShortName(script[j]);
        UnicodeString str(shortname, strLength, US_INV);
        temp.append("[:");
        temp.append(str);
        temp.append(":]+");
    }
    pattern = temp.remove(temp.length()-1,1);
    pattern.append("]");
    
    UnicodeSet cnvSet(pattern, err);
    printf("Number of script code associated are : %d \n", num);
    printf("Range count: %d\n", cnvSet.getRangeCount());
    printf("Set size: %d\n", cnvSet.size());
    for(int32_t i=0; i<cnvSet.getRangeCount(); i++) {
        printf("Range start: %x\n", cnvSet.getRangeStart(i));
        printf("Range end: %x\n", cnvSet.getRangeEnd(i));
    }
    

    Results for language "en" from this example:

    Number of script code associated are : 1

    Range count: 30

    Set size: 1272

    Range start: 41

    Range end: 5a

    Range start: 61

    Range end: 7a

    ...

    Range start: ff41

    Range end: ff5a

    Which means all the characters ranges that correspong to the Latin block.