Every programming language has their own interpretation of \n
and \r
.
Unicode supports multiple characters that can represent a new line.
From the Rust reference:
A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.
Based on that statement, I'd say a Rust character is a new-line character if it is either \n
or \r
. On Windows it might be the combination of \r
and \n
. I'm not sure though.
What about the following?
In my opinion, we are missing something like a char.is_new_line()
.
I looked through the Unicode Character Categories but couldn't find a definition for new-lines.
Do I have to come up with my own definition of what a Unicode new-line character is?
There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $
against a string like \r\r\n\n
in multi-line-mode: Are there two lines (\r\r\n
, \n
), three lines (\r
, \r\n
, \n
, like Unicode says) or four (\r
, \r
, \n
, \n
, like JS sees it)? Go and Python do not treat \r\n
as a single $
and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.
So the takeaway here is
\n
is a newline\r\n
may be a single newline\r\n
is treated as two newlines\r\n
is "some character followed by a newline"If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t
instead as well.
Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5
for why \r\r\n
should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)