Search code examples
pythonregexpython-2.7backtracking

Prevent Catastrophic Backtracking in Regex


I have a code to scrape a million websites and detect contact info from their homepage.

For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done

I have run code several times but it gets stuck around 60k requests.

It doesnt get stuck on a certain website.

Here is Regex I am using

    emails = re.findall('[\w\.-]+@[\w-]+\.[\w\.-]+', lc_body)
    mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
    abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)

    licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)

My thought is licences's regex is causing issues, how can I simplify it? How can I remove Backtracking ?

I want to find all Licence numbers possible.

It can be License No: 2543 , License: 2543, License # 2543, License #2543, License# 2543 and many other combinations as well.


Solution

  • The issue is caused with the third group: (\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+) - all alternatives start with \s* here. This causes lots of redundant backtracking as these alternatives can match at the same location in a string. The best practice is to use alternatives in an alternation group that do not match at the same location.

    Now, looking at the strings you need to match, I suggest using

    Lic(?:en[cs]e)?(?:\W*No:)?\W*\d+
    

    See the regex demo

    Make the pattern more specific and linear, get rid of as many alternations as possible, use optional non-capturing groups and character classes.

    Details:

    • Lic(?:en[cs]e)? - Lic followed with 1 or 0 occurrences (the (?:...)? is an optional non-capturing group since ? quantifier matches 1 or 0 occurrences of the quantified subpatterns) of ence or ense (the character class [sc] matches either s or c and is much more efficient than (s|c))
    • (?:\W*No:)? - a non-capturing group that matches 1 or 0 occurrences of 0+ non-word chars (with \W*) followed with No: substring
    • \W*
    • \d+ - 1 or more digits.