Search code examples
mysqlruby-on-railsutf-8character-encodinglatin1

Strange delimited hex in MySQL - need to convert to UTF8


OK. So I have a large, legacy database that is backing a high traffic website. The tables are latin1 encoded and I'm in the process of converting to UTF-8. We have converted the site to Rails, and we are starting to access the DB directly. It seems that something very strange is going on with utf8 characters inserted into the database however. We are using Tolk (https://github.com/dhh/tolk) to convert the site to Chinese, and unfortunately, the site was setup before converting the translations table to UTF-8. The problem is that we are getting a strange character format inserted into the latin1 table for unicode chars.

Here is an example:

--- "xfire\xE7\x94\xA8\xE6\x88\xB7\xEF\xBC\x9F\xE8\xAF\xB7\xE7\x82\xB9\xE5\x87\xBB<a dialog-name='account_actions' href='#login' class='dialog_link login add_overlay'>Sign in</a>\xE7\xBC\x96\xE8\xBE\x91\xE4\xBD\xA0\xE7\x9A\x84\xE8\xB4\xA6\xE6\x88\xB7\xE4\xBF\xA1\xE6\x81\xAF"

The data is serialized as YAML, and Rails or the database seem to be doing something to convert the unicode chinese characters into this backslash delimited hex format.

Any ideas what might be going on? Is there a way to translate these hex strings into the corresponding utf-8 characters?


Solution

  • It turns out that the issue was with YAML (see Rails: encoding woes with serialized hashes despite UTF8).

    Adding this to environment.rb totally solved the problem:

    YAML::ENGINE.yamler= 'syck' if defined?(YAML::ENGINE)