Search code examples
.htaccessrobots.txt

How to disallow robots in .htaccess and robots.txt?


I tried to disallow Amazonbot to my website, and I tried to use robots.txt by adding these lines:

User-agent: Amazonbot
Disallow: /

After several hours I noticed this robot did not follow robots.txt so I use .htaccess and added the following line:

# Block Harmful Robot
<IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} (Amazonbot) [NC]
    RewriteRule (.*) - [F,L]
</IfModule>
# END Block Harmful Robot

Still, I see this robot in my website statistics report. Is there any other way to block this robot?


Solution

  • For friendly bots that follow the Robots Exclusion Protocol RFC 9309 blocking them in robots.txt should be just fine. Once blocked it can take some time for the changes to become effective as web crawlers usually cache robots.txt files for some time. The Robots Exclusion Protocol allows caching for up to 24 hours (https://datatracker.ietf.org/doc/html/rfc9309#name-caching), or even longer if it is unreadable.

    Another possibility is that the observed bot is simply pretending to be an AmazonBot, but is actually a bot that does not adhere to the Robots Exclusion Protocol. You can verify the bot using a combination of reverse DNS and DNS lookups as described on the Amazonbot page.

    If you are using Apache 2.4 with mod_authz_host you can combine the User-Agent directive with the following directive to allow only the verified Amazonbot and block bots that are only pretending:

    Require host crawl.amazonbot.amazon