Search code examples
pythonregexurlip-addressipv4

Resolving Catastrophic Backtracking issue in RegEx


I am using RegEx for finding URL substrings in strings. The RegEx I am using has been taken from tohster's answer on - What's the cleanest way to extract URLs from a string using Python?

The RE is -

r'^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@)?(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$'

I have done some changes to it -

  1. In the IPv4 detection part, I changed the order of the IP range to be found. > Precisely, changed [1-9]\d?|1\d\d|2[01]\d|22[0-3] to 25[0-5]|2[0-4][0-9]|1[0-> 9]{2}|[1-9][0-9]|[0-9] at 2 instances.
  2. Made the https group - (?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@) optional.

The final version is -

(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?

The final RE I am using seems to be very promising and has improved significantly as per my requirements(as compared to the original one) and works in Python as well as Java Script, except for the fact that due to the changes I have done have caused the following examples to give "catastrophic backtracking" error -

asasasasasac31.23.53.122asasassasd

12312312312321.32.34.2312312312321

12.3423423432.234123123.123

31.134232131.231.34

Can be tested at - https://regex101.com/r/i6jDei/1

My contention is that the first example - asasasasasac31.23.53.122asasassasd should have some slick way to pass as the IP is surrounded by non-numeric chars.

Also, is there a way to pass the first two of the above examples as valid IPv4 addresses?

To resolve ambiguity, I would opt for the largest possible Address, i.e.,

31.23.53.122

21.32.34.231


Solution

  • The issue of the catastrophic backtracking is caused by the pattern (?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})) where (?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+) will jump through a lot of combinations, if the overall pattern can not be matched. As you can see the character classes are basically the same, so e.g. for asasasasasac31 it can match like:

    (asasasasasac31)
    (a)(sasasasasac31)
    (a)(s)(asasasasac31)
    (as)(asasasasac31)
    

    This is not really the way it actually takes, just to show how many combinations exist.

    The mistake here seems to be the - being optional which I see no reason for. If we remove the -, we get it working for your samples (and reduce the number of steps for the already working samples).

    See the updated regex101-demo, where I also added your samples that caused the catastrophic backtracking.

    The final pattern then is:

    (?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])|(?:(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?