Search code examples
pythonregexpython-3.xbacktrackingregex-lookarounds

Prevent backtracking on regex to find non-comment lines (not starting with indented '#')


I'd like to search for lines that don't start with a pound sign (#) on indented code.

Currently, I'm using the regex ^\s*([^\s#].*) with multiline option on.

My problem is that on non commented lines it works perfectly.

On commented lines the regex engine performs a backtrack due to \s* all the way from the comment sign to the start of the line, which can sometimes cause 40 or 50 backtrack steps.

The regex works perfectly on python code. It's just not very efficient due to the backtracking caused by the engine.

Any idea as of how to avoid it?


Bonus: It's rather funny that the regex engine doesn't recognize the fact that it's searching for [^\s] one by one in \s* and causes this amount of backtracking. What are the challenges in making the re engine work so?

Bonus 2: Using only the stdlib re module. As I cannot add 3rd parties. (I'm technically searching using sublime text but want to know how to generally do it in Python)


Solution

  • Use atomic feature of lookarounds to avoid backtrack:

    ^(?=(\s*))\1([^#].*)
        ^^^^^  ^
    

    This usage is simplified in a negative lookahead which is proposed by @vks beautifully.

    or possessive quantifiers while using regex module:

    ^\s*+([^#].*)
    

    or even atomic groups:

    ^(?>\s*)([^#].*)
    

    Sublime Text supports all three since being on PCRE.

    and for bonus part, no it's not funny. If you be more eagle-eye on it you'll see it's not [^\s] which is literally equal to \S but it is a little bit different: [^\s#] which for engine means it has two different paths at each step to look for so it backtracks to reach one.