Search code examples
asp.netregexiisurl-rewritingurl-rewrite-module

Regex negative lookahead true then ignore rest of regex


I'm using the following IIS Rewrite Rule to block as many bots as possible.

<rule name="BotBlock" stopProcessing="true">
  <match url=".*" />
  <conditions>
    <add input="{HTTP_USER_AGENT}" pattern="^$|\b(?!.*googlebot.*\b)\w*(?:bot|crawl|spider)\w*" />
  </conditions>
  <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
</rule>

The goal is to block all user agents with the parts bot, crawl or spider in it, but allow the Google Bot. This works to an extend. But the problem is that the second part of the regex is also triggered, even if "googlebot" is found in the string.

Below some examples what mean:

 Googlebot/2.1 (+http://www.google.com)

Works fine, the 'bot' part in googlebot is ignored and the request is permitted.

 Googlebot/2.1 (+http://www.google.com/bot.html)

Does not work, still triggers on the second 'bot' in the string and the request is blocked

 KHTML, like Gecko; compatible; bingbot

Works fine, is triggered on the bot in bingbot and the request is blocked

So can someone help me to change the rexeg so the string with Googlebot/2.1 (+http://www.google.com/bot.html) is allowed?


Solution

  • I'm not familiar with IIS's exact regex flavor (presumably .NET) but this should work if you can enable case-insensitive regex'ing:

    ^(?!.*googlebot).*(?:bot|crawl|spider)
    

    Explanation:

    • ^ - start line anchor
    • (?!.*googlebot) - ahead of me, the word "googlebot" does not exist
    • .*(?:bot|crawl|spider) - capture everything leading up to a positive match of the word "bot", "crawl", or "spider"

    The combination of negative look-ahead and positive forward capturing produces an implicit and condition in regex; both must be true in order for the regex to register a match.

    https://regex101.com/r/ri6Qs7/1


    To note: I am not sure why your regex starts with ^$| unless you are purposely looking to provide a 403 to requests with an empty user agent.