Search code examples
character-encodingcross-platformmojibake

Character Encoding and the ’ Issue


Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:

Bad Encoding

(Note: This is an example, not a spam job post... :-)

I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.

My two-part question:

  • What causes this particular, common encoding issue?
  • As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.

Solution

  • What causes this particular, common encoding issue?

    This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.

    In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, and .

    This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).


    As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.

    Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.

    If you're familiar with Java (your question history confirms this), you may find this article useful.