Search code examples
utf-8character-encoding

What character encoding is this?


I'm interfacing with an Oracle DB, which has some messed up encoding (ASCII7 according to the db properties, but actually encodes Korean characters).

When I get some of the Korean strings from the resultSet, and look at the bytes, it turns out that they correspond exactly to this file (I found by googling some of the byte sequences): http://211.115.85.9/files/raw3.txt

Kinda spooky, as it seems to be the ONLY thing on the internet that has anything about this particular encoding...

The file, when viewed with EditPlus3, shows me 3 columns.

The first column is an alphabetical listing of Korean characters. The second is the strange encoding I'm finding from looking at the Java strings passed from the Oracle DB. The third one is UTF8.

I'm trying to figure out what the middle column is encoded in. Can anyone point me in the right direction?

(I really don't want to have to actually read from this file every time I need to call a DB...)


Solution

  • It is EUC-KR (or a similar) encoded data, interpreted as another 1-byte encoding (ISO-8859-1 or similar) and encoded using UTF-8.

    In other words: it's ill-encoded data, but might be salvagable:

    byte[] bytes = new byte[] { (byte) 0xc2, (byte) 0xb0, (byte) 0xc2, (byte) 0xa1 };
    String str = new String(bytes, "UTF-8");
    bytes = str.getBytes("ISO-8859-1");
    str = new String(bytes, "EUC-KR");
    System.out.println(str);
    

    This prints 가 on my system.

    I've found this PDF file which explains the problem (and how it happend) in more detail.