Search code examples
phphttpvalidationcharacter-encoding

When is a character encoding actually applied?


Many resources exist describing character encoding best practices and bit sequences, but without an accurate map of the content’s journey, I’m struggling to understand and apply them.

  • I know how to tell my code editor to save files in UTF-8.
  • I know how to include a character encoding meta tag in my HTML.
  • I know how to declare character encoding for a form.

But my mental model is missing so many steps!

I’ve included a diagram to illustrate. Purple is the server; red is the browser; green is the OS (Windows XP in the diagram, but could be anything).

Server sends data (1) to a browser (2) running on an operating system to generate a webpage containing a form (3) with an input (4) into which an em-dash and right single quotation have been entered, along with regular ASCII characters, above a submit button (5) which will send data back to the server (6)

  1. What does PHP send in its response (in the body)?
    • Does it send exactly what it received from my code editor and assume that the echoed characters will be valid?
    • Does it echo using the encoding I told it I wrote my file in?
    • Does it encode in a standardized HTTP encoding?
  2. When the client’s browser receives server data, does it...
    • Scan the response headers for a character encoding value?
    • Assume a standardized HTTP encoding until it reaches my meta tag? (And if found, does it re-decode previous content?)
    • Output exactly what it received, relying on the user’s OS to handle encoding?
  3. When exactly is the form’s character encoding applied? (See below)
  4. How is user data entered into the form via their keyboard encoded?
    • OS encoding (as though the browser opened a little door for the OS to enter and display its own data)
    • Browser encoding (storing OS keystrokes in some browser-specific format)
    • Form encoding (translating OS characters to the declared encoding of the form)
    • HTML document encoding (translating OS characters to the encoding in the meta tag)
  5. What does the browser post to the server?
    • Unmodified user data (depends on #4, but probably the original OS encoding)
    • User data encoded in the form’s declared encoding
    • User data encoded in the HTML meta tag’s encoding
    • User data in a standardized HTTP encoding
  6. When the server reads the data back into PHP, is it...
    • Decoded from a standardized HTTP encoding into PHP’s runtime encoding
    • Decoded from an encoding declared in the request headers
    • Unmodified user data (relying wholly on the developer to handle any conflicts)

Solution

  • I think an important piece your mental model may be missing is the distinction between bytes and characters. At different steps and different levels, text is either treated as opaque, meaningless bytes, or the computer is aware of the text as characters.

    When the computer treats text as characters, it will be stored in some form of byte representation in memory, yes, but that is an irrelevant implementation detail and how exactly it's represented in memory may differ between different programs. The important part is that the computer is aware that "漢字" is "漢字", and can produce a byte representation of these characters in any valid encoding at any time.

    Browser

    The browser is character aware. With anything happening inside it, the browser is treating text as text. When it gets any files from the server, it looks at the HTTP headers or other fallback indicators to figure out what encoding that file is in, decodes it from that encoding, and treats all text as known, specific characters henceforth.

    When entering text into a form, the OS takes care of the underlying details, including receiving key codes from the keyboard, mapping those through the chosen keyboard layout, perhaps involving an IME for text transformation (e.g. to enter 日本語), and provides the browser with characters.

    When it comes time to send those characters to the server, the browser determines what encoding needs to be used, based on various factors like the form's accept-encoding attribute or fallbacks like the site's determined encoding. It then represents the text as bytes in that encoding. At this point, characters may be substituted by HTML entities, if the target encoding cannot represent the character. It may then apply another transport encoding like URL-percent encoding to those bytes. This then gets sent to the server.

    PHP

    PHP doesn't by default do anything with encodings. It is not text-aware and treats all data as mere meaningless bytes. So you have to make sure in your code that you know what encoding any received text is in and treat it accordingly. PHP will decode URL-percent encoding for populating $_GET and $_POST, but these variables will just contain the transport-decoded bytes, not text.

    Whatever you output from PHP will be output as is. What that is depends on where it came from. Anything that comes from (source code) files on disk depends on how it was saved in the text editor. Anything coming from a database depends on how you established the database connection; databases are generally text-aware and will provide you the text in the encoding you request, which you can configure. It's usually best to ensure everything is in UTF-8 all the way.

    PHP and/or the web server should make sure to output the correct headers which correctly denote what encoding the content you're outputting is in, so the browser can correctly determine it.