iterate over characters blocks in utf-8

My task is to iterate over all the utf-8 character codes corresponding to a given language (locale). I suppose it's not that easy and I have to iterate over characters blocks (like the whole cyrilic for "ru_RU", for example). I can find characters blocks on the wiki page https://en.wikipedia.org/wiki/UTF-8, but I hope there are better ways than inventing my own bicycle.

I've had a look at icu-project, but I can't figure out if I can do what I need.

What I want to have as result is something like this:

for (unsignet int=UBLOCK_GREEK_EXTENDED; i<UBLOCK_GREEK_EXTENDED_SIZE; i++) {
    // do stuff
}

icu-project is a very powerfull tool, so I hope someone know how to do this :)

UPDATE: I'm working on a localization options for a 3D framework for mobile devices. It rasterizes and encodes truetype fonts so they can be easily rendered by picking required images from rasterized fonts files. Since I have to care about memory amount, I want to split rasterized font in different files for different locales (or languages, or characters blocks like cirylic or greek), so I don't have to keep the whole utf-8 font in memory all the time, but only load corresponding file after detecting locale.

Thanks!

Solution

So, I've finaly fund the way to do it properly usind the icu-project library http://site.icu-project.org.

Here is an example solution. You specify locale or language and get an array of utf-8 character blocks that contain symbols relative to the locale/language. You can then get start and end for each characters block.

UErrorCode err = U_ZERO_ERROR;
const int32_t capacity = 10;
const char* shortname = NULL;
int32_t num, j;
int32_t strLength = 4;
UScriptCode script[10] = {USCRIPT_INVALID_CODE};
num = uscript_getCode("en", script, capacity, &err);
UnicodeString temp = UnicodeString("[", 1, US_INV);
UnicodeString pattern;
for(j=0; j<num; j++) {
    shortname = uscript_getShortName(script[j]);
    UnicodeString str(shortname, strLength, US_INV);
    temp.append("[:");
    temp.append(str);
    temp.append(":]+");
}
pattern = temp.remove(temp.length()-1,1);
pattern.append("]");

UnicodeSet cnvSet(pattern, err);
printf("Number of script code associated are : %d \n", num);
printf("Range count: %d\n", cnvSet.getRangeCount());
printf("Set size: %d\n", cnvSet.size());
for(int32_t i=0; i<cnvSet.getRangeCount(); i++) {
    printf("Range start: %x\n", cnvSet.getRangeStart(i));
    printf("Range end: %x\n", cnvSet.getRangeEnd(i));
}

Results for language "en" from this example:

Number of script code associated are : 1

Range count: 30

Set size: 1272

Range start: 41

Range end: 5a

Range start: 61

Range end: 7a

...

Range start: ff41

Range end: ff5a

Which means all the characters ranges that correspong to the Latin block.