Search code examples
pythonregexsplitnewlinebacktracking

Split by double newline, prioritizing crlf


The naive way to accomplish this would be:

import re
re.split(r'(?:\r\n|\r|\n){2}', '...')

But:

>>> re.split(r'(?:\r\n|\r|\n){2}', '\r\n\r\n\r\n')
['', '', '']

I'd like to get ['', '\r\n'] in this case. I probably need some sort of possessiveness or make it not backtrack. Is there a way?


Solution

  • You may restrict the \n and \r matching positions using lookarounds to avoid matching them when in a CRLF:

    r'(?:\r\n|\r(?!\n)|(?<!\r)\n){2}'
    

    Python test:

    >>> import re
    >>> re.split(r'(?:\r\n|\r(?!\n)|(?<!\r)\n){2}', '\r\n\r\n\r\n')
    ['', '\r\n']
    

    See the regex graph:

    enter image description here

    Details

    • (?:\r\n|\r(?!\n)|(?<!\r)\n){2} - a non-capturing group (if you a capturing one, the value captured with the last iteration will be output into the resulting list with re.split, too) that matches two repetitions of:
      • \r\n - a CRLF sequence
      • | - or
      • \r(?!\n) - CR symbol not followed with LF
      • | - or
      • (?<!\r)\n - LF symbol not preceded with CR.