Search code examples
javascriptregexnode.jsv8

Javascript regexp infinite loop related to uppercase letters


I've created a regexp in node.js able to match parts of text that look like an URL. However, on some text, my regexp creates an infinite loop!

The whole regexp is:

/(((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\.[a-z\d/\-._~:?#@!$&'*+,;=`]+)/gi

But after some debug, I found that the infinite loop is caused only by this part: /((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\./gi

> let r = /((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\./gi
> const txt = "www.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_~:/?#@!$&'*+,;=`"
> r.test(txt)
^CError: Script execution interrupted. // infinite loop

I've done more tests to understand the problem. First, without flags:

> r = /((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\./
> r.test(txt)
true

Surprisingly, it works! Test 2 with only g flag:

> r = /((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\./g
> r.test(txt)
true

Works too! Test 3 with only i flag:

> r = /((ftp|https?):\/\/|(www\.))?([a-z\d]+-?)+\./i
> r.test(txt)
^CError: Script execution interrupted. // infinite loop

Failure... So I decided to remove i flag and use A-Z instead:

> r = /((ftp|https?):\/\/|(www\.))?([a-zA-Z\d]+-?)+\./
> r.test(txt)
^CError: Script execution interrupted. // infinite loop

Not working... Does someone understand the problem ? Is there a solution ?


Solution

  • This is a catastrophic backtracking.

    To avoid this phenomenon, try to build a regex that would always progress from left to right (i.e. a failure to match would stop the regex, not make it try some sub part elsewhere). Said otherwise: there shouldn't be several ways to match a part of your string (or the regex engine will try them all, which can, in some case, lead to an exponentially sized tree of possibilities).

    Trying to understand your real goal, I can propose this solution:

    /((ftp|https?):\/\/|(www\.))?([a-z\d]+)(-[a-z\d]+)*\./gi
    

    (which is also probably more correct)

    Another solution would be to use a non backtracking regex engine such as R2 (yes there is a binding for node). In my experience R2 in node is slower than the standard regex engine so it only makes sense when you have user inputted regexes as you most often can find a non catastrophic regex when you're the regex author.