Search code examples
encodingcharacter-encodingchinese-localeshift-jis

Getting a character map of any obscure character set/encoding (e.g. ibm-943_P14A-2000)


Recently our software has had an issue with certain obscure kanji (Chinese characters) not being picked up by our Shift-JIS encoding. I made an algorithm to read through any Shift-JIS string to try to find any "out of bounds" Kanji and switch the string to UTF-8 instead (which has more characters, but uses more space).

In order to find what Kanji won't be covered, I need to get my hands on a character map of the ibm-943_P14A-2000 encoding.

Where does one go about finding maps for these character sets? It's pretty easy via web search to find UTF8 lookups and the like, but I simply cannot find a chart/table/file of what values correspond to what values in this encoding.

If you could point me in any direction, no matter how obscure, I'd be very grateful.


Solution

  • The ICU project has a fairly large set of character set mapping tables, including ibm-943_P14A-1999. The difference between '1999' and '2000' is explained in this thread—and you can check out older versions of the ICU source code for the old table. The format of the table is described in the ICU User Guide.

    As for the original character mappings (the character set of IBM-943), they are documented here.