Search code examples
pythonregexregex-negation

Speed up regexp to ban wordlist from urls


I'm working on a regular expression that is meant to ban local websites that contain certain words in the url. The structure of the websites is: http|https://mysite.si with the banned word potentially appearing before the '.si' or after it(in the path). I'm doing this because my content filter is not very good at blocking local websites that I dont want my kids exposed to. So far I've come up with the following:

(?!.*(word1|word2|word3...|wordx))(https|http)://.*[.]si

where wordx represents a banned word. While I'm happy that the above filters out what I want it to filter out, I'm finding the performance to be too slow(The wordlist consists of 400 words) and would appreciate any suggestions for improving performance.


Solution

  • You might make the pattern perform slightly better by changing the alternation to https?:// and match the protocol first, adding the negative lookahead after it.

    For matching the string, you can change the .* to \S* to match non whitespace chars if there can not be spaces.

    If you perhaps know which words occur more than others, you can add the more frequent ones in the beginning, and for example make the quantifier non greedy to get to an assert result faster.

    To prevent a partial match, you could add word boundaries \b around the pattern.

    Depending on the word list, you might also add word boundaries \b(?:word1|word2|word3)\b around the group.

    \bhttps?://(?!\S*?(?:word1|word2|word3...|wordx))\S*[.]si\b