I am trying to do a re.split using a regex that is utilizing look-behinds. I want to split on newlines that aren't preceded by a \r. To complicate things, I also do NOT want to split on a \n if it's preceded by a certain substring: XYZ.
I can solve my problem by installing the regex module which lets me do variable width groups in my look behind. I'm trying to avoid installing anything, however.
My working regex looks like:
regex.split("(?<!(?:\r|XYZ))\n", s)
And an example string:
s = "DATA1\nDA\r\n \r\n \r\nTA2\nDA\r\nTA3\nDAXYZ\nTA4\nDATA5"
Which when split would look like:
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
My closest non-working expression without the regex module:
re.split("(?<!(?:..\r|XYZ))\n", s)
But this split results in:
['DATA1', 'DA\r\n \r', ' \r', 'TA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
And this I don't understand. From what I understand about look behinds, this last expression should work. Any idea how to accomplish this with the base re module?
You can use:
>>> re.split(r"(?<!\r)(?<!XYZ)\n", s)
['DATA1', 'DA\r\n \r\n \r\nTA2', 'DA\r\nTA3', 'DAXYZ\nTA4', 'DATA5']
Here we have broken your lookbehind assertions into two assertions:
(?<!\r) # previous char is not \r
(?<!XYZ) # previous text is not XYZ
Python regex engine won't allow (?<!(?:\r|XYZ))
in lookbehind due to this error
error: look-behind requires fixed-width pattern