I'm authoring HTML5 documents and was a little surprised that the default text encoding (without HTTP headers or meta element setting it) defaults to windows-1252 on the browsers that I have tested (Safari, Chrome, Firefox - recent versions as of Feb 2023, macOS).
In particular, I'm using the <!DOCTYPE html>
but forgot to add the <meta charset="utf-8">
element. If I open the file locally, browsers perform auto-detection and use utf-8 when non-ascii chars are present - but not if files are served through a web server.
I understand that browsers can't simply default to utf-8 for all HTML files due to old content and auto-detection for HTTP served content is hard (reasoning described here https://hsivonen.fi/utf-8-detection/).
What I don't understand, however, is why a modern HTML5 document in standards mode (with doctype set) does not also use utf-8 by default?
Edit: The similar Why it's necessary to specify the character encoding in an HTML5 document if the default character encoding for HTML5 is UTF-8? question asks why one needs to set the encoding if one (wrongly) assumes utf-8 as default, not what the default is (or how it's selected).
Through this question (thanks exa.byte and Rob!) and the HTML spec I believe I was able to piece together an answer.
Short answer: No, HTML5 has no default character encoding (but read on).
Long answer: Obviously browsers will use some encoding to display the page. When none is specified, the algorithm first uses auto-detection. In my testing browsers actually do this for local files (url starting with file://
) and some might even do it for remote files but the standard encourages not doing this for remote files beyond the first 1kb (this is where the meta charset tag has to be). Limiting to first 1kb is recommended to not stall parsing for too long. Browsers can also entirely skip the auto-detection step if they want (this is what Firefox does for remote files I believe).
Side note: Above no encoding specified means no BOM, no Content-Type with charset, no meta tag, no inherited from parent iframe, and no XML declaration (yes, this is used for text/html too).
So, if auto-detection didn't select the encoding, such as having multiple possibilities or browser didn't have enough data available at the time, the browser selects an implementation-defined option. This can be browser-dependent but HTML5 suggests utf-8 for controlled environments or locale-based default (#9 here) otherwise.
Finally, to explain the behavior I saw with getting the windows-1252 encoding. The reason was because a) auto-detection failed (the non-ascii characters were at the end of page) and b) the browsers I use selected it based on my preferred/selected locale.