Search code examples
unicodecharacter-encodingshift-jisgb2312

EUC-JP or GB18030 Text File


I have a text file with the following contents in it: Ã(195) Ü(220) Â(195) ë(235) Ó(211) Ã(195) »(187) §(167) Ã(195) û(251) Ã(195) Ü(220) Â(194) ë(235) Ã(195) û(251) ³(179) Æ(198) Ã(195) û(251) ³(179) Æ(198). For simplicity, along with the text I have added the Unicode values that I got from http://www.fileformat.info/. Going by the Unicode Character set, this file seems to comply with this line A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE. mentioned in https://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP and my rendering engine seems to display Japanese characters. However, this actually is a Chinese Text File containing 密码用户名密码名称名称 which gets recognized as GB2312 encoded file by Notepad++. Are there some more restrictions for determining if a file is JIS-X-0208(EUC-JP) encoded, since it seems to comply with what Wiki says?

However, my rendering engine seems to recognize this file as both EUC-JP and Chinese,but since EUC-JP comes higher in the order, we think it is Japanese and Japanese Characters are displayed.


Solution

  • There is no completely reliable way to identify an unknown encoding.

    Distribution patterns can probably help you determine whether you are looking at an 8-bit or a 16-bit encoding. Double-byte encodings tend to have a slightly constrained distribution pattern for every other byte. This is where you are now.

    Among 16-bit encodings, you can also probably easily determine whether you are looking at a big-endian or little-endian encoding. Little-endian will have the constrained pattern on the even bytes, while big-endian will have it on the odd bytes. Unfortunately, most double-byte encodings seem to be big-endian, so this is not going to help much. If you are looking at little-endian, it's likely UTF-16LE.

    Looking at your example data, every other byte seems to be equal to or close to 0xC3, starting at the first byte (but there seem to be some bytes missing, perhaps?)

    There are individual byte sequences which are invalid in individual encodings, but on the whole, this is rather unlikely to help you reach a conclusion. If you can remove one or more candidate 16-bit encodings with this tactic, good for you; but it will probably not be sufficient to solve your problem.

    Within this space, all you have left is statistics. If the text is long enough, you can probably find repeating patterns, or use a frequency table for your candidate encodings to calculate a score for each. Because the Japanese writing system is sharing a common heritage with Chinese, you will find similarities in their distributions, but also differences. Typologically, the Japanese language is quite different from Chinese, which means that Japanese will have particles every few characters, whereas Chinese does not have them at all. So you would look for "no" の, "wa" は, "ka" か, "ga" が, "ni" に etc and if they are present, conclude that you are looking at Japanese (or, conversely, surmise that perhaps you are looking at Chinese if they are absent; but if you are looking at lists of names, for example, it could still be Japanese).

    Within Chinese (and also tangentially for Japanese) you can look at http://www.zein.se/patrick/3000char.html for frequency information; but keep in mind that the Japanese particles will be much more common in Japanese running text than any of these glyphs.

    For example, 的 (the first item on the list) aka U+7684 will be 0x76 0x84 in UTF-16be, 0xAA 0xBA in Big-5, 0xC5 0xAA in EUC-JP, 0xB5 0xC4 in GB2312, etc.

    From your sample data, you likely have item 139 on that list 名 aka U+540D which is 0x54 0x0D in UTF-16be, 0xA5 0x57 in Big-5, 0xCC 0xBE in EUC-JP, and 0xC3 0xFB in GB2312. (Do you see? Hit!)