Search code examples
unicoderustnewlinecarriage-returnlinefeed

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?


Every programming language has their own interpretation of \n and \r. Unicode supports multiple characters that can represent a new line.

From the Rust reference:

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.

Based on that statement, I'd say a Rust character is a new-line character if it is either \n or \r. On Windows it might be the combination of \r and \n. I'm not sure though.

What about the following?

  • Next line character (U+0085)
  • Line separator character (U+2028)
  • Paragraph separator character (U+2029)

In my opinion, we are missing something like a char.is_new_line(). I looked through the Unicode Character Categories but couldn't find a definition for new-lines.

Do I have to come up with my own definition of what a Unicode new-line character is?


Solution

  • There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \r\r\n\n in multi-line-mode: Are there two lines (\r\r\n, \n), three lines (\r, \r\n, \n, like Unicode says) or four (\r, \r, \n, \n, like JS sees it)? Go and Python do not treat \r\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

    So the takeaway here is

    • It is agreed upon that \n is a newline
    • \r\n may be a single newline
    • unless \r\n is treated as two newlines
    • unless \r\n is "some character followed by a newline"
    • You shall not have any more newlines beside that.

    If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t instead as well.

    Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \r\r\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)