I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.
Pattern: \/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)
Example: https://www.regex101.com/r/pR6eH6/2
How can I avoid the backtracking?
You have heavy backtracking because of the alternation. Instead of the (?:.|[\r\n])
, you may consider using a character class [\s\S]
that boosts performance to a noticeable extent:
\/\*[\s\S]*?\*\/|\/\/.*
See demo
In Python, you can use the re.S
/re.DOTALL
modifier to make .
match line breaks, too (note that the single line comment pattern should be matched with \/\/[^\r\n]*
then):
/\*.*?\*/|//[^\r\n]*
See another demo
However, since *?
lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
, and the whole regex will now look like:
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*
See yet another demo
Details:
/\*
- a /*
[^*]*
- zero or more chars other than *
\*+
- one or more asterisks(?:[^/*][^*]*\*+)*
- zero or more sequences of:
[^/*]
- a symbol other than /
and *
[^*]*
- zero or more symbols other than *
\*+
- 1+ asterisks/
- a /
symbol|
- or //.*
- //
and any 0+ chars other than than line break chars.Just wanted to note that in Python, you do not need to escape /
(in JS, you do not need to escape /
when declaring a regex using the RegExp constuctor).
NOTE: The last pattern does not allow simple capturing what is inside /*
and */
, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing *
- /\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*)
- and then you'd need to remove the last char from .group(1)
.