google-chrome unicode utf-8 character-encoding windows-1252

Chrome form POST shows "(unable to decode value)" and database stores it as a question mark

I have a test site and test DB both set to windows-1252. When I type Alt+234 into Chrome it puts this symbol in the field: Ω. And when I submit the form it posts and stores it as Ω I'm assuming this is the browser saying "hey, this isn't in the specified charset but I do know of an html equivalent, so I'll post that instead". Fine. The symbol appears properly after saving, I can save, save, save, and it always appears fine. But if I try the same thing with Alt+230 the browser does not submit it's html entity value of µ. Instead I see "(unable to decode value)" when viewing the POST in the Chrome DevTool window. And it ends up being stored in the database as a question mark.

Why does it treat Alt+234 (Ω) differently than Alt+230 (µ)?

I know I should switch to UTF8 but I still would like to know why it is functioning this way. Thanks!

Solution

U+03A9 Ω Greek capital letter omega is not part of Windows code page 1252.

U+00B5 µ Micro sign (which is not the exact same character as Greek mu) is part of 1252 (byte 181).

The Alt+keypad shortcut numbers don't align with code page 1252, or the current ANSI code page in general, so being able to type a character from that shortcut doesn't imply membership of those code pages. Instead they are from DOS code page 437.

And when I submit the form it posts and stores it as Ω I'm assuming this is the browser saying "hey, this isn't in the specified charset but I do know of an html equivalent, so I'll post that instead"

Yes, this is a long-standing weird unrecoverable mangling that HTML5 finally standardised, for when a character is not encodable in the encoding the page has requested.

Instead I see "(unable to decode value)" when viewing the POST in the Chrome DevTool window. And it ends up being stored in the database as a question mark.

The browser will be sending that character as code page 1252 byte 181. The devtools and whatever your application is aren't expecting to be dealing with code page 1252 bytes... probably they are expecting UTF-8. Because byte 181 on its own is not a valid UTF-8 sequence they can't keep it.