I have a code to scrape a million websites and detect contact info from their homepage.
For some reasons, when I run code, it gets stuck and does not proceed after crawling about 60k requests, I am marking the website URLs in my DB as status=done
I have run code several times but it gets stuck around 60k requests.
It doesnt get stuck on a certain website.
Here is Regex I am using
emails = re.findall('[\w\.-]+@[\w-]+\.[\w\.-]+', lc_body)
mobiles = re.findall(r"(\(?(?<!\d)\d{3}\)?-? *\d{3}-? *-?\d{4})(?!\d)|(?<!\d)(\+\d{11})(?!\d)", lc_body)
abns = re.findall('[a][-\.\s]??[b][-\.\s]??[n][-\:\.\s]?[\:\.\s]?(\d+[\s\-\.]?\d+[\s\-\.]?\d+[\s\-\.]?\d+)', lc_body)
licences = re.findall(r"(Licence|Lic|License|Licence)\s*(\w*)(\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)(\d+)", lc_body, re.IGNORECASE)
My thought is licences
's regex is causing issues, how can I simplify it? How can I remove Backtracking ?
I want to find all Licence numbers possible.
It can be License No: 2543
, License: 2543
, License # 2543
, License #2543
, License# 2543
and many other combinations as well.
The issue is caused with the third group: (\s*|\s*#\s*|\s*.\s*|\s*-\s*|\s*:\s+)
- all alternatives start with \s*
here. This causes lots of redundant backtracking as these alternatives can match at the same location in a string. The best practice is to use alternatives in an alternation group that do not match at the same location.
Now, looking at the strings you need to match, I suggest using
Lic(?:en[cs]e)?(?:\W*No:)?\W*\d+
See the regex demo
Make the pattern more specific and linear, get rid of as many alternations as possible, use optional non-capturing groups and character classes.
Details:
Lic(?:en[cs]e)?
- Lic
followed with 1 or 0 occurrences (the (?:...)?
is an optional non-capturing group since ?
quantifier matches 1 or 0 occurrences of the quantified subpatterns) of ence
or ense
(the character class [sc]
matches either s
or c
and is much more efficient than (s|c)
)(?:\W*No:)?
- a non-capturing group that matches 1 or 0 occurrences of 0+ non-word chars (with \W*
) followed with No:
substring\W*
\d+
- 1 or more digits.