Search code examples
javascriptnode.jsregexcodeql

inefficient regular expression in javascript


Hi in our below code using codeql scanning got an alert that "This part of the regular expression may cause exponential backtracking on strings starting with '0' and containing many repetitions of '0'."

const validateUrl = str => {
  var pattern = new RegExp(
    "^(https?:\\/\\/)?" + // protocol
    "((([a-z\\d]([a-z\\d-]*[a-z\\d])*)\\.)+[a-z]{2,}|" + // domain name

please let me know how can we resolve this and what exactly the problem?


Solution

  • As the other answer has correctly identified, the problem is ([a-z\d-]*[a-z\d])*. This RegEx recognizes a superset of ([a-z\d]+)*. The nesting of the two quantifiers gives a naive backtracking RegEx implementation too many options to try: Consider a long string consisting of a's (or 0's, or any other character in [a-z\d-]) followed by $. The RegEx engine will first try matching a single a, then matching further according to the *. If this fails, it will later try matching two a's and so on. The same will be repeated afterwards. Ultimately your RegEx engine will have tried all options of partitioning the a's into substrings of length > 0, of which there are superexponentially many.

    Note that the enclosing (...\.)+ is not an issue, since the dot serves as a delimiter.

    Your first option, which wouldn't require changing the RegEx at all, would simply be to switch to a RegEx engine with guaranteed linear time complexity. This flaw is only exhibited by backtracking RegEx engines and not by RegEx engines which use DFAs (Deterministic Finite Automatons). There seem to be NPM packages which can convert RegEx to DFA.

    Your second option is to rewrite the RegEx. [a-z\d]([a-z\d-]*[a-z\d])* matches any sequence of alphanumerics and dashes as long as the dashes aren't at the end or beginning; thus we can convert the outer * quantifier into a ? quantifier: [a-z\d]([a-z\d-]*[a-z\d])?. Alternatively, we could also write it as [a-z\d]+(\-+[a-z\d]+)*. This nests quantifiers, but the dashes serve as a delimiter, so this shouldn't exhibit catastrophic backtracking either.

    You're not showing the full RegEx, but it is most likely insufficient to validate URLs. A proper, spec-compliant RegEx to validate URLs is rather lengthy. Take a look at the syntax diagram on Wikipedia; you're missing entire parts of it (like userinfo, port, query, fragment). Peter Seliger's comment is spot on:

    The OP might consider delegating such a validation task to an Web-Api URL compatible node implementation.