Search code examples
pythonregexlookbehind

Odd behavior on negative look behind in python


I am trying to do a re.split using a regex that is utilizing look-behinds. I want to split on newlines that aren't preceded by a \r. To complicate things, I also do NOT want to split on a \n if it's preceded by a certain substring: XYZ.

I can solve my problem by installing the regex module which lets me do variable width groups in my look behind. I'm trying to avoid installing anything, however.

My working regex looks like:

regex.split("(?<!(?:\r|XYZ))\n", s)

And an example string:

s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"

Which when split would look like:

['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']

My closest non-working expression without the regex module:

re.split("(?<!(?:..\r|XYZ))\n", s)

But this split results in:

['DATA1', 'DA\r\n \r', ' \r', 'TA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']

And this I don't understand. From what I understand about look behinds, this last expression should work. Any idea how to accomplish this with the base re module?


Solution

  • You can use:

    >>> re.split(r"(?<!\r)(?<!XYZ)\n", s)
    ['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
    

    Here we have broken your lookbehind assertions into two assertions:

    (?<!\r)  # previous char is not \r
    (?<!XYZ) # previous text is not XYZ
    

    Python regex engine won't allow (?<!(?:\r|XYZ)) in lookbehind due to this error

    error: look-behind requires fixed-width pattern