Search code examples
javascriptpythonregexbacktracking

Parse C-Style Comments with Regex, avoid Backtracking


I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.

Pattern: \/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)

Example: https://www.regex101.com/r/pR6eH6/2

How can I avoid the backtracking?


Solution

  • You have heavy backtracking because of the alternation. Instead of the (?:.|[\r\n]), you may consider using a character class [\s\S] that boosts performance to a noticeable extent:

    \/\*[\s\S]*?\*\/|\/\/.*
    

    See demo

    In Python, you can use the re.S/re.DOTALL modifier to make . match line breaks, too (note that the single line comment pattern should be matched with \/\/[^\r\n]* then):

    /\*.*?\*/|//[^\r\n]*
    

    See another demo

    However, since *? lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/, and the whole regex will now look like:

    /\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*
    

    See yet another demo

    Details:

    • /\* - a /*
    • [^*]* - zero or more chars other than *
    • \*+ - one or more asterisks
    • (?:[^/*][^*]*\*+)* - zero or more sequences of:
      • [^/*] - a symbol other than / and *
      • [^*]* - zero or more symbols other than *
      • \*+ - 1+ asterisks
    • / - a / symbol
    • | - or
    • //.* - // and any 0+ chars other than than line break chars.

    Just wanted to note that in Python, you do not need to escape / (in JS, you do not need to escape / when declaring a regex using the RegExp constuctor).

    NOTE: The last pattern does not allow simple capturing what is inside /* and */, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing * - /\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*) - and then you'd need to remove the last char from .group(1).