Search code examples
pythonregexregex-lookaroundspositive-lookahead

Regex python - Match newline only if it is followed by number or special character and space


I've been trying to figure out this regex in Python but it's not been producing the expected result.

I have a text file which I load that is in the format of:

"18 75 19\n!dont split here\n! but split here\n* and split here"

I'd like to get the following output:

['18 75 19\n!dont split here',
 '! but split here',
 '* and split here']

I'm trying to split my string by either 1) a new line followed by a number, or 2) a new line followed by a special character only if it is followed by a space (e.g. '! but split here', but not '!dont split here').

Here's what I have so far:

re.split(u'\n(?=[0-9]|([`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?])(?= ))', str)

This is close, but not there yet. Here's the output it produces:

['18 75 19\n!dont split here', '!', '! but split here', '*', '* and split here']

It incorrectly matches the special character separately: '!' and '*' have their own element. There are two lookahead operators in the regex.

I'd really appreciate if you could help identify what I could change with this regex for it to not match the single special character, and just match the special character followed by the full line.

I'm also open to alternatives. If there's a better way that doesn't involve two lookaheads, I'd also be interested to understand other ways to tackle this problem.

Thanks!


Solution

  • Your regex is actually working, the issue is with the capturing group you have around [`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]. From the manual:

    If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

    If you remove the () around that character class, you will get the results you expect.

    Note that you don't need (?= ) in that alternation as it is already part of a lookahead, you can just use (space). Also you might find it easier to write the symbols as a negated character class i.e.

    re.split(u'\n(?=[0-9]|[^A-Za-z0-9] )', str)