Search code examples
c++unicodebijection

Index-based access on Matrix-like structure in C++


I have a mapping Nx2 between two set of encodings (not relevant: Unicode and GB18030) under this format: Warning: huge XML, don't open if having slow connection: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

Snapshot:

<a u="00B7" b="A1 A4"/>
<a u="00B8" b="81 30 86 30"/>
<a u="00B9" b="81 30 86 31"/>
<a u="00BA" b="81 30 86 32"/>

I would like to save the b-values (right column) in a data structure and to access them directly (no searching) with indexes based on a-values (left column).

Example:

I can store those elements in a data structure like this:

unsigned short *my_page[256] = {my_00,my_01, ....., my_ff}

, where the elements are defined like:

static unsigned short my_00[256] etc.

. So basically a matrix of matrix => 256x256 = 65536 available elements.

In the case of other encodings with less elements and different values (ex. Chinese Big5, Japanese Shift, Korean KSC etc), I can access the elements using a bijective function like this:

element = my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF];, where unicode[i] is filled with the a-like elements from the mapping (as mentioned above). How do I generate and fill the my_page structure is analogous. For the working encodings, I have like around 7000 characters to store (and they are stored in a unique place in my_page).

The problem comes with the GB18030 encoding, trying to store 30861 elements in my_page (65536 elements). I am trying to use the same bijective function for filling (and then accessing, analogously) the my_page structure, but it fails since the access mode does not return unique results.

For example: For the unicode values, there are more than 1 element accessed via my_page[(unicode[i]>>8)&0x00FF][unicode[i]&0x00FF] since the indexes can be the same for i and for i+1 for example. Do you know another way of accessing/filling the elements in the my_page structure based only on pre-computed indexes like I was trying to do?

I assume I have to use something like a pseudo-hash function that returns me a range of values VRange and based on a set of rules I can extract from the range VRange the integer indexes of my_page[256][256].

If you have any advice, please let me know :)

Thank you !


Solution

  • For GB18030, refer to this document: http://icu-project.org/docs/papers/gb18030.html

    As explained in this article: “The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size.” So most probably is not good to implement a conversion based on a pure mapping table. For large parts, there is a direct mapping between GB18030 and Unicode. Most of the four-bytes characters can be translated algorithmically. The author of the article suggests to handle them such ranges with a special code, and the other ones with a classic mapping table. These characters are the ones given in the XML mapping table: http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml

    Therefore, the index-based access on Matrix-like structure in C++ can be a problem opened for whom wants to research on such bijective functions.