Search code examples
utf-8character-encodingwindows-1252

Confused about conversion between windows-1252 and UTF-8 encoding


I have a legacy database that claims to have collation set to windows-1252 and is storing a text field's contents as

I’d

When it is displayed in a legacy web-app it shows as I’d in the browser. The browser reports a page encoding of UTF-8. I can't figure out how that conversion has been done (almost certain it isn't via an on-the-fly search-and-replace). This is a problem for me because I am taking the text field (and many others like it) from the legacy database and into a new UTF-8 database. A new web app displays the text from the new database as

I’d

and I would like it to show it as I’d. I can't figure out how the legacy app could have achieved this (some fiddling in Ruby hasn't showed me a way to affect converting a string I’d to I’d).

I've tied myself in a knot here somewhere.


Solution

  • It probably means the previous developer screwed up data insertion (or you're screwing up somewhere). The scenario goes like this:

    • the database connection is set to latin1
    • app actually sends UTF-8 to database
    • database interprets received data as latin1, stores it as such (interprets ’ as ’)
    • app queries for the data again
    • database returns ’ encoded in latin1
    • app interprets the data as UTF-8, resulting in ’

    You essentially need to do the same misinterpretation to get good data. Right now you may be querying the database through a utf8 connection, so the database returns ’ encoded in UTF-8. What you need to do is query through a latin1 connection and interpret the data as UTF-8 instead.

    See Handling Unicode Front To Back In A Web App for a more detailed explanation of all this.