Search code examples
unicodecharacter-encodingshift-jis

Space-saving character encoding for japanese?


In my opinion a common problem: character encoding in combination with a bitmap-font. Most multi-language encodings have an huge space between different character types and even a lot of unused code points there. So if I want to use them I waste a lot of memory (not only for saving multi-byte text - i mean specially for spaces in my bitmap-font) - and VRAM is mostly really valuable... So the only reasonable thing seems to be: Using an custom mapping on my texture for i.e. UTF-8 characters (so that no space is waste). BUT: This effort seems to be same with use an own proprietary character encoding (so also own order of characters in my texture). In my specially case I got texture space for 4096 different characters and need characters to display latin languages as well as japanese (its a mess with utf-8 that only support generall cjk codepages). Had somebody ever a similiar problem (I really wonder, if not)? If theres already any approach?

Edit: The same Problem is described here http://www.tonypottier.info/Unicode_And_Japanese_Kanji/ but it doesnt provide an real solution how to save these bitmapfont mappings to utf-8 space efficent. So any further help is welcome!

Edit2:

Thank you very much for your answer. Im sorry, that my problem wasn't clear enough described.

What I really want to solve, is: The CJK Unicode range is over 20000 characters. But only a subset of around 2000 characters are necessary to display japanese text properly. These characteres are spreaded in range from U+4E00 to U+9FA5. So I need to transform these Unicode Codepoints (only the 2000 for japanese) somehow to the coordinates of my created texture (where I can order the characters also like I want).

i.e. U+4E03 is a japanese character, but U+4E04, U+4E05, U+4E06 is not. Then U+4E07 is a japanese character as well. So the easiest solution, I can see: after character U+4E03 leave three spaces in my texture (or write the not necessary characters U+4E04, U+4E05, U+4E06 there) and then write U+4E07. But this would waste soo much texture space (20000 characters, even if only 2000 are necessary). So I want to be able to put in my texture only: "...U+4E03, U+4E07...". But I have no idea how to write my displayText function then - because I cant know where are the texture coordinates of the glyph I want to display. There would be a hashmap or something like this necessary, but I have no idea how to store these data (it would be a mess to write for every character something like ...{U+4E03, 128}, {U+4E07, 129}... to fill the hasmap).

To the questions: 1) No specific format - so I will write the displayText function by myself. 2) No reason against unicode - its only that CJK range problem for my bitmapfont. 3) I think, thats generally plattform & language independent, but in my case Im using C++ with OpenGL on Mac OS X/iOS.

Thank you very much for your help! If you have any further idea for this, it would really help me a lot!


Solution

  • What is the real problem you want to solve?

    Is it that a UTF-8 encoded string occupies three bytes per character? If yes, switch to UTF-16. Otherwise don't blame UTF-8. (Explanation: UTF-8 is just an algorithm to convert a sequence of integers to a sequence of bytes. It has nothing to do with the grouping of characters in codepages. That in turn is what Unicode code points are for.)

    Is it that the Unicode code points are distributed over many "codepages" (where a "codepage" means a block of 256 adjacent Unicode code points)? If yes, invent a mapping from the Unicode code points (0x000000 - 0x10FFFF) to a smaller set of integers. In terms of memory this should cost no more than 4 bytes times the number of characters you really need. The lookup time would be approximately 24 memory accesses, 24 integer comparisons and 24 branch instructions. (In fact, this would be a binary search in a tree map.) And if that's too expensive you could use a mapping based on a hash table.

    Is it something else? Then please give us some examples, to better understand your problem.

    As far as I understand it you should probably write a small utility program that takes as input a set of Unicode code points that you want to use in your application and then generates The code and data for displaying texts. This raises the questions:

    1. Do you have to use a specific bitmap font format or will you write the displayText function yourself?
    2. Is there any reason against using Unicode for all strings and to convert them to your bitmap-optimized encoding just for the time when you render text? The encoding conversion would of course be internal to the displayText method and not visible to the normal application code.
    3. Just out of interest: Is the problem specific to a certain programming language or environment?

    Update:

    I am assuming that your main problem is some function like this:

    Rectangle position(int codepoint)
    

    If I had to do this, I would start by having one bitmap for each character. The bitmap's file name would be the codepoint, so that the "big picture" can be regenerated easily, just in case you find some more characters you need. The preparation consists of the following steps:

    1. Load all the bitmaps and determine their dimensions. The result of this step is a map from integers to (width, height) pairs.
    2. Compute a good layout for the character images in the big picture and remember where each character was placed. Save the big picture. Save the mapping from codepoints to (x, y, width, height) to another file. This can be a text file, or if you don't have disk space, a binary file. The details don't matter.

    The displayText function would then work as follows:

    void displayText(int x, int y, String s) {
      for (char c : s.toCharArray()) { // TODO: handle code points correctly
        int codepoint = c;
        Rectangle position = positions.get(codepoint);
        if (position != null) {
          // draw bitmap
          x += position.width;
        }
      }
    }
    
    Map<Integer, Rectangle> positions = loadPositionsFromFile();
    

    Now the only problem that is left is how this map can be represented in memory using as little memory as possible, and still be fast enough. That, of course, depends on your programming language.

    The in-memory representation could be a few arrays that contain x, y, width, height. For each element, a 16 bit integer should be enough. And probably you only need 8 bits for width and height anyway. Another array would then map the codepoint to the index into positionData (or some special value if the codepoint is not available). This would be an array of 20000 16 bit integers, so in summary you have:

    • 2000 * (2 + 2 + 1 + 1) = 12000 bytes for positionX, positionY, positionWidth and positionHeight
    • 20000 * 2 = 40000 bytes for codepointToIndexInPositionArrays, if you use an array instead of a map.

    Compared to the size of the bitmap itself, this should be small enough. And since the arrays don't change they can be in read-only memory.