Search code examples
jqueryunicodehtml-entitiescp1252

Unicode entities displayed as CP1252


I've decided to write myself a little script for a Unicode reference, since my favourite on-line Unicode look-up site has become buggy and full of ads. It's been an enjoyable project so far. I've noticed, however, that some characters are displayed incorrectly.

For example, codepoint Ux8E should be a control character called "SINGLE SHIFT TWO" - and in fact that's the name that gets displayed, but the character itself shows up as Ž - that's the character which should be at Ux17D, "LATIN CAPITAL LETTER Z WITH CARON". It's also the CP1252 character at x8E, so that must be a clue to the source of the confusion.

Why is my browser generating and displaying a character in CP1252 encoding, and how can I stop it? Currently the script is running locally on my Mac - it's JavaScript, mostly jQuery, in HTML 5: the characters themselves are expressed as, e.g. "&x8e;" and inserted using a jQuery append(). The script itself is encoded in UTF-8, and the HTML specifies UTF-8 in meta. Is it an Apache issue? An OS issue? I haven't done extensive browser testing but it's the same in Safari, Firefox, Opera and Chrome so I guess it's not that.

I could simply remove all control characters since they're not meant to display anyway. I'm currently ajaxing the character names into the page from an XML file containing information on all Unicode characters, so while I'm doing that I could check whether or not a character is a control character and remove it accordingly. But the XML is huge and the Ajax is slow enough to make it confusing as a quick reference, so I'd really like to find a way of just forcing my computer not to show me rubbish in the first place.

Any ideas?


Solution

  • This is a buggy website workaround.

    For example: the bullet (U+2022) is encoded as byte 0x95 in a few single byte character sets, like Windows-1252. As a consequence some people would include a bullet in their web page by writing •. Which presumably used to work if that browser was using the same encoding.

    Normally • indeed encodes a control character. But since those control characters are normally not used in web pages, even modern browsers assume this entity refers to the encoded value in windows-1252 and display a bullet. (the correct numeric entity for a bullet is •)

    These days you would usually specify the encoding of your page (often utf-8) and just literally write the bullet character in the HTML page.

    This is also the way to stop this behaviour. Just use the characters (eg. by using $element.text("•")) and don't use numeric entities.