Search code examples
unicodenewlinewhitespaceline-breaksparagraph

Do information separators constitute line-breaks in Unicode?


This Wikipedia article which lists all Unicode whitespaces mentions 7 of them as line/paragraph separating characters (LF, VT, FF, CR, NEL, LS, PS). Here there is nothing given about ASCII 'information separator' characters (FS, GS, RS, US). But surprisingly FS, GS, RS have 'paragraph separator (B)' as their bidirectional class. This is confusing.

Now, when I encounter one of these 'information separator' characters in a text, should I consider them as line-break or not? In other words, if I am writing a function which splits at line breaks, then should I split at these three characters? (string.splitlines() function in Python does consider them as line breaks. I don't know about other implementations.)

For example:

  1. Both in the linked Wikipedia table and in the Unicode bidi class database, LF is considered as line-break. So I can break line when I encounter that character.

  2. Both in the linked Wikipedia table and in the Unicode bidi class database, SP is not considered as line-break. So I can't break a line when I encounter that character. (suppose no word-wrap).

  3. The linked Wikipedia table does not mention GS as a line-break. But the Unicode bidi class database does mention it as line-break. I'm confused: what should I do in this case? What does bidi class refer to in this case?

Here I'm only asking about the Unicode standard. But if you know, you can also mention about line-breaks in the ASCII standard.

PS: I'm not sure whether the table in the linked Wikipedia page is correct. But I wasn't able to find any other good resource which lists all whitespaces.


Solution

  • FS, GS, RS, and US belong to the line break class Combining_Mark (CM). The relevant file in the Unicode Character Database for this information is LineBreak.txt.

    UAX #14 (Unicode Line Breaking Algorithm) describes class CM as follows:

    Combining character sequences are treated as units for the purpose of line breaking. The line breaking behavior of the sequence is that of the base character.

    In other words: Class CM characters prohibit line breaks before them – they essentially “glue” themselves to the previous character. However, for all other purposes, the line breaking algorithm completely ignores the presence of class CM characters. Whether or not a line break opportunity exists after a class CM character depends solely* on the line break class of the base character it has been applied to, i.e. the first character going backwards that is not of class CM.

    *There are some exceptions to this rule involving mandatory breaks and a few special formatting characters, but they shouldn’t be relevant for your purposes.