Call me prehistoric but I'm trying to use XHTML document type encoded in a UTF8 html page with a PRE tag containing text with some unicode linebreaks u2028.
Firefox at least seems not to honor u2028 as a linebreak in a PRE block. Changing the character to u000D or u000a seems to produce the linebreaks I'm expecting. (Technically the u2028 is encoded in UTF8 as a 3 byte sequence but I assume it is normalized back when it read). I haven't tested this with other browsers yet.
I tried digging through the W3C docs on HTML but was not able to figure out from the section on PRE just exactly what characters are treated as linebreaks. Where is chapter and verse on exactly what is interpreted as linebreaks in PRE? Is u2028 treated as such, with Firefox being defective, or the HTML standard brain dead in not interpreting u2028 as a line break when found in a Unicode file?
It seems pretty weird to me that a text (e.g. source code) file containing unicode would not use u2028 as a standard for line breaks (I actually have a code generator that produces source code like this, and I'm trying to display that code in an HTML page). Thus placing such code straight into PRE blocks i would think would produce the behavior I expect.
Despite what the nature of the PRE element might suggest, its rendering behavior is actually specified in CSS, not in HTML, as it pertains to whitespace rendering.
CSS2 says that U+000D and U+000A count as newlines, and user agents may recognize and normalize other Unicode characters as such. It doesn't mention U+2028 anywhere however.
css-text-3 covers whitespace and line break processing much more comprehensively. It defines the term segment break as follows:
For CSS processing, each document language–defined segment break, CRLF sequence (U+000D U+000A), carriage return (U+000D), and line feed (U+000A) in the text is treated as a segment break, which is then interpreted for rendering as specified by the
white-space
property.
Like CSS2, it doesn't mention U+2028.
But, in a later section, it does mention forced break characters (of which U+2028 is one):
When determining line breaks:
- Regardless of the
white-space
value, lines always break at each preserved forced break character: for all values, line-breaking behavior defined for the BK, CR, LF, CM, NL, and SG line breaking classes in [UAX14] must be honored.
Note that it even says "Regardless of the white-space
value"; this means that even outside of a PRE element, U+2028 must introduce a line break (in a similar fashion to a BR element)!
As for implementations, Internet Explorer and Microsoft Edge appear to be the only browsers that render U+2028 as a line break within a PRE element with the default of white-space: pre
. The only caveat is that they normalize it to U+000A so it ends up being treated as regular whitespace outside of the PRE element (or white-space: pre
/pre-line
). This matches what css-text-3 says about preserved forced breaks, but I'm not sure if the act of normalizing U+2028 to U+000A itself is acceptable, or a Unicode/CSS spec violation.
Chrome on Windows 10 always prints a symbol labeled LSEP, and Firefox always prints a zero-width character.
Whether the document is application/xhtml+xml or text/html seems to make no difference in any of these cases.