My task is to iterate over all the utf-8 character codes corresponding to a given language (locale). I suppose it's not that easy and I have to iterate over characters blocks (like the whole cyrilic for "ru_RU", for example). I can find characters blocks on the wiki page https://en.wikipedia.org/wiki/UTF-8, but I hope there are better ways than inventing my own bicycle.
I've had a look at icu-project, but I can't figure out if I can do what I need.
What I want to have as result is something like this:
for (unsignet int=UBLOCK_GREEK_EXTENDED; i<UBLOCK_GREEK_EXTENDED_SIZE; i++) {
// do stuff
}
icu-project is a very powerfull tool, so I hope someone know how to do this :)
UPDATE: I'm working on a localization options for a 3D framework for mobile devices. It rasterizes and encodes truetype fonts so they can be easily rendered by picking required images from rasterized fonts files. Since I have to care about memory amount, I want to split rasterized font in different files for different locales (or languages, or characters blocks like cirylic or greek), so I don't have to keep the whole utf-8 font in memory all the time, but only load corresponding file after detecting locale.
Thanks!
So, I've finaly fund the way to do it properly usind the icu-project library http://site.icu-project.org.
Here is an example solution. You specify locale or language and get an array of utf-8 character blocks that contain symbols relative to the locale/language. You can then get start and end for each characters block.
UErrorCode err = U_ZERO_ERROR;
const int32_t capacity = 10;
const char* shortname = NULL;
int32_t num, j;
int32_t strLength = 4;
UScriptCode script[10] = {USCRIPT_INVALID_CODE};
num = uscript_getCode("en", script, capacity, &err);
UnicodeString temp = UnicodeString("[", 1, US_INV);
UnicodeString pattern;
for(j=0; j<num; j++) {
shortname = uscript_getShortName(script[j]);
UnicodeString str(shortname, strLength, US_INV);
temp.append("[:");
temp.append(str);
temp.append(":]+");
}
pattern = temp.remove(temp.length()-1,1);
pattern.append("]");
UnicodeSet cnvSet(pattern, err);
printf("Number of script code associated are : %d \n", num);
printf("Range count: %d\n", cnvSet.getRangeCount());
printf("Set size: %d\n", cnvSet.size());
for(int32_t i=0; i<cnvSet.getRangeCount(); i++) {
printf("Range start: %x\n", cnvSet.getRangeStart(i));
printf("Range end: %x\n", cnvSet.getRangeEnd(i));
}
Results for language "en" from this example:
Number of script code associated are : 1
Range count: 30
Set size: 1272
Range start: 41
Range end: 5a
Range start: 61
Range end: 7a
...
Range start: ff41
Range end: ff5a
Which means all the characters ranges that correspong to the Latin block.