Search code examples
javamysqljdbccharacter-encodingmalformed

How to use Java to encode/decode/transcode byte array using a charset directly


I have a malformed string which may be caused by a bug of MySQL JDBC driver,

The bytes of a sample malformed string (malformed_string.getBytes("UTF-8")) is this:

C3 A4 C2 B8 C2 AD C3 A6 E2 80 93 E2 80 A1  (UTF-8 twice)

which should encoded the following bytes (it's already UTF-8 encoded, but treat them as ISO-8859-1 enoded)

----- ----- ----- ----- -------- --------
  E4   B8    AD     E6     96       87     (UTF-8)

which should encoded the following Unicode BigEndian bytes

---------------     ---------------------
     4E 2D                65 87            (Unicode BigEndian)

I want to decode the 1st one to the 2nd one, I tried new String(malformed_string.getBytes("UTF-8"), "ISO-8859-1"), but it does not transcode as expected. I'm wondering if there's something like byte[] encode/decode (byte[] src, String charsetName), or how to achieve the transcode above in java?

Background:

I have a MySQL table which have Chinese column names, when I update such columns with long data, MySQL JDBC driver thrown an exception like this:

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1

The column name in the exception is malformed, it should be "中文", and it must be correctly displayed to user as the following.

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1

EDIT

Here's MySQL statement to demonstrate how the malformed string occured, and how to restore it to correct string

show variables like 'char%';
+--------------------------+--------------------------+
| Variable_name            | Value                    |
+--------------------------+--------------------------+
| character_set_client     | utf8                     |
| character_set_connection | utf8                     |
| character_set_database   | utf8                     |
| character_set_filesystem | binary                   |
| character_set_results    | utf8                     |
| character_set_server     | utf8                     |
| character_set_system     | utf8                     |
| character_sets_dir       | C:\mysql\share\charsets\ |
+--------------------------+--------------------------+

-- encode
select
    hex(convert(convert(unhex('E4B8ADE69687') using UTF8) using ucs2)) as `hex(src in UNICODE)`,
    unhex('E4B8ADE69687') `src in UTF8`,
    'E4B8ADE69687' `hex(src in UTF8)`,
    hex(convert(convert(unhex('E4B8ADE69687') using latin1) using UTF8)) as `hex(src in UTF8->Latin1->UTF8)`;
+---------------------+-------------+------------------+--------------------------------+
| hex(src in UNICODE) | src in UTF8 | hex(src in UTF8) | hex(src in UTF8->Latin1->UTF8) |
+---------------------+-------------+------------------+--------------------------------+
| 4E2D6587            | 中文        | E4B8ADE69687     | C3A4C2B8C2ADC3A6E28093E280A1   |
+---------------------+-------------+------------------+--------------------------------+
1 row in set (0.00 sec)


-- decode
select
    unhex('C3A4C2B8C2ADC3A6E28093E280A1') as `malformed`,
    'C3A4C2B8C2ADC3A6E28093E280A1' as `hex(malformed)`,
    hex(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1)) as `hex(malformed->UTF8->Latin1)`,
    convert(convert(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1) using binary)using utf8) `malformed->UTF8->Latin1->binary->UTF8`;
+----------------+------------------------------+------------------------------+---------------------------------------+
| malformed      | hex(malformed)               | hex(malformed->UTF8->Latin1) | malformed->UTF8->Latin1->binary->UTF8 |
+----------------+------------------------------+------------------------------+---------------------------------------+
| 中文         | C3A4C2B8C2ADC3A6E28093E280A1 | E4B8ADE69687                 | 中文                                  |
+----------------+------------------------------+------------------------------+---------------------------------------+
1 row in set (0.00 sec)

Solution

  • Check out this tutorial: http://download.oracle.com/javase/tutorial/i18n/text/string.html

    The punch line is the way to transcode using only the jdk classes is :

    try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();
    
    String roundTrip = new String(utf8Bytes, "ISO-885-9");
    System.out.println("roundTrip = " + roundTrip);
    System.out.println();
    printBytes(utf8Bytes, "utf8Bytes");
    System.out.println();
    printBytes(defaultBytes, "defaultBytes");
    } catch (UnsupportedEncodingException e) {
     e.printStackTrace();
    }
    

    For a more robust transcoding mechanism I suggest you check out ICU>